Full Report
Security research involves long hours of staring at code and is done only by a specialized group of people. With the rise of LLMs comes the ability to use AI tools to find vulnerabilities. They built a bot to think as security engineers do. Identify suspious behaviour Prove reachability of the code Prove controllability. Can the attacker influence the relevant data/state? Determine real world impact Of those steps above, if any of them go wrong, then the bug won't be found. This is because it's long-form reasoning with compounding errors. Intuitive reasoning can be done locally, but it's bad globally. Precision decays the longer the chains get. The key insight is that you need checkpoints to enforce correctness and not just more tokens. Instead of using better prompts, they created harnesses. This is a set of constraints, scaffolding and checks to force an agent to be systematic in its approach. They do this with the following steps: Generate hypotheses explicitly. Collect evidence before escalating confidence. Use deterministic tools when possible. Fail fast and prune dead ends. Produce artifacts a reviewer can trust The post includes a great graph that explains their reasoning. First, it is an exponentially decreasing value that scales with reasoning length; the longer a chain, the worse it does. The other value on the graph is a shark tooth. For each verifiable subtask, the confidence is regained. After this, they have some good insights into what has worked for them. First, the usage of deterministic tools when possible. Using CodeQL to find sinks is better than asking an LLM to do so. This is because it's deterministic and only requires the LLM to use CodeQL. Another point is that native tools work better with their home model. For instance, Claude Code works best with Opus. Scanners have multiple issues. From multi-step flow identification to boundary issues, they do fail. The authors claim they use static analysis tools as much as possible and then rely on agentic reasoning to bridge the gap. This uses LLMs only when necessary, keeping things deterministic. When reviewing code, not all lines are equal in terms of threat. Some repos/components only need shallow checks, while others need deep integration. By putting spend only onto difficult and promising areas, the costs stay lower and you will find more bugs. The final major benefit is testing. If the code has a bug, this should be provable. Run the simulation, execute the PoC, and check whether the expected outcome occurred. This tends to remove false positives and improve confidence in an issue. Although not all tests are created equal, there's a major difference between an isolated unit type and a full simulation. This bot found a max payout critical of $250K on Immunefi recently. No word on what the bug is but it's very interesting. They have other bugs on their profile as well.
Analysis Summary
# Research: Bounty-Grade L1 Security Research
## Metadata
- **Authors**: Kritt Team
- **Institution**: Kritt AI
- **Publication**: Kritt.ai Technical Blog
- **Date**: 2024 (Recent)
## Abstract
This research introduces a novel framework for automating high-stakes vulnerability discovery using Large Language Models (LLMs) managed by "harnesses." By shifting from simple prompt engineering to a systematic, tool-augmented agentic workflow, the researchers demonstrate that AI can move beyond shallow "hallucination-prone" checks to finding critical, real-world vulnerabilities in Layer 1 (L1) blockchain protocols and complex codebases.
## Research Objective
The research addresses the "Precision Decay" problem in long-form AI reasoning. Specifically:
- Can an LLM-based agent identify vulnerabilities that require deep reachability and controllability analysis?
- How can we prevent "compounding errors" in the long chains of logic required for security engineering?
- Can AI be used to earn professional bug bounties on platforms like Immunefi?
## Methodology
### Approach
The researchers developed an agentic workflow modeled after the cognitive process of a security engineer:
1. **Hypothesis Generation**: Explicitly stating potential exploit vectors.
2. **Evidence Collection**: Gathering data before increasing confidence scores.
3. **Check-pointing**: Utilizing deterministic constraints to verify reasoning at sub-steps.
4. **Pruning**: "Failing fast" on dead-end logic to save compute and improve accuracy.
5. **Artifact Generation**: Producing verifiable Proofs of Concept (PoCs).
### Dataset/Environment
The system was tested against live projects, specifically targeting high-complexity environments like L1 blockchains and protocols hosted on Immunefi.
### Tools & Technologies
- **Models**: Anthropic’s Claude 3.5 Sonnet / Opus (noted for high performance with native coding tools).
- **Deterministic Tools**: CodeQL (for sink identification and data flow).
- **Static Analysis**: Specialized scanners integrated into the agent's toolkit.
- **Simulation**: Execution environments to run generated PoCs.
## Key Findings
### Primary Results
1. **Verification Beats Volume**: Increasing the number of tokens (longer prompts) results in an exponential decrease in accuracy. However, "verifiable subtasks" act as a "shark tooth" graph, restoring confidence at each checkpoint.
2. **Hybridization is Essential**: LLMs excel at bridging the gaps between existing static analysis tools, rather than replacing them entirely.
3. **Real-World Efficacy**: The system successfully identified a "Critical" severity bug on Immunefi, resulting in a $250,000 payout.
### Supporting Evidence
- The "Shark Tooth" reasoning graph: Demonstrates that without checkpoints, reasoning quality trends toward zero; with checkpoints (harnesses), quality is reset to high levels regularly.
### Novel Contributions
- **The "Harness" Concept**: Moving away from prompt engineering toward "scaffolding" that forces an agent to be systematic rather than intuitive.
- **Agentic Bridge**: Using LLMs specifically to connect deterministic tool outputs (like CodeQL sinks) to complex exploit paths.
## Technical Details
The core innovation is the **Harness Architecture**. Security engineering requires proving three things: **Behavior** (is it suspicious?), **Reachability** (can we get there?), and **Controllability** (can we influence the state?).
The system uses **Deterministic Primitives** (like CodeQL) to find the "Sink" (the dangerous line of code). It then uses the LLM to perform "Backwards Taint Analysis" to see if user input can reach that sink. If a step fails verification (e.g., the simulation doesn't crash as expected), the agent discards the path immediately.
## Practical Implications
### For Security Practitioners
- Automating the "boring" parts of security research (scanning and reachability) allows humans to focus on high-level logic.
- Cost-effective scaling: Using "spend" only on promising code paths.
### For Defenders
- The barrier for finding complex exploits is lowering. Automated agents can now perform deep-chain reasoning that previously required weeks of human effort.
### For Researchers
- The focus should shift from "Can the LLM find the bug?" to "How do we build the environment that allows the LLM to verify its own findings?"
## Limitations
- **Tool Dependency**: The agent is only as good as the deterministic tools (CodeQL, etc.) it has access to.
- **Simulation Environment**: Not all bugs are easily simulated in an isolated environment (e.g., complex race conditions or cross-chain state issues).
- **Cost**: Deep reasoning on large codebases still requires significant compute spend, necessitating the "pruning" logic.
## Comparison to Prior Work
Traditional AI security tools often rely on "one-shot" vulnerability detection which suffers from high False Positive rates. Kritt’s approach differs by treating the LLM as a **reasoning engine** within a constrained execution loop, rather than a simple pattern matcher.
## Real-world Applications
- **Automated Bug Bounty Hunting**: Continuous scanning of repositories for high-value payouts.
- **Continuous Integration/Deployment (CI/CD)**: Integrating agentic reasoning into the PR review process for deeper security checks than standard linters.
## Future Work
- Improving the "agent-tool" interface to reduce latency.
- Expanding the library of deterministic tools the agent can use.
- Hardening the simulation environments to handle more complex, multi-component states.
## References
- kritt[.]ai/blog/bounty-grade-l1-security-research
- Immunefi Bug Bounty Platform
- GitHub - GitHub/CodeQL