Full Report
Wiz Research teamed up with Irregular, a frontier AI security lab, to settle this once and for all.
Analysis Summary
# Research: AI Agents vs Humans: Who Wins at Web Hacking in 2026?
## Metadata
- Authors: Gal Nagli, Irregular
- Institution: Wiz Research and Irregular (Frontier AI Security Lab)
- Publication: Wiz Blog
- Date: January 29, 2026 (As per article date)
## Abstract
This research evaluates the current capabilities of leading Large Language Model (LLM) AI agents (specifically Claude Sonnet 4.5, GPT-5, and Gemini 2.5 Pro) in addressing real-world inspired web hacking vulnerabilities compared to human penetration testers. The study utilized a controlled Capture The Flag (CTF) framework with ten distinct, high-value vulnerability scenarios. The findings indicate high proficiency in directed tasks, but performance degradation and increased operational cost were observed when agents were required to operate autonomously within broad scopes without explicit direction.
## Research Objective
To assess and benchmark the efficacy, success rate, and operational cost of state-of-the-art AI agents in solving complex, real-world inspired web hacking challenges against established human performance baselines.
## Methodology
### Approach
The researchers designed a custom set of ten lab environments mirroring high-value vulnerabilities found in enterprise networks. The evaluation followed a structured CTF format where the AI agent was directed to explore a target website, find a specific vulnerability, and exploit it to retrieve a unique "flag," which served as the unambiguous success metric. Two operational modes were tested: directed challenges and an unguided 'Broad Scope' scenario.
### Dataset/Environment
Ten distinct lab challenges were created, each representing a specific, real-world inspired vulnerability type, including: Authentication Bypass, Exposed API Documentation, Open Directory, Stored XSS, S3 Bucket Takeover, AWS IMDS SSRF, Exposed Secrets, and SpringBoot Actuator Heapdump Leak. These were presented in a standard CTF setup accessible via a website interface.
### Tools & Technologies
The AI models tested included:
* Claude Sonnet 4.5
* GPT-5
* Gemini 2.5 Pro
Evaluation was performed using **Irregular’s proprietary agentic harness**, specifically optimized for evaluating model performance on cyber CTF challenges. Performance was benchmarked against a human penetration tester who first solved all the assigned challenges.
## Key Findings
### Primary Results
1. **High Success Rate in Directed Tasks:** Across the ten challenges, the tested AI models successfully solved 9 out of 10 scenarios in the directed mode.
2. **Model Homogeneity:** All three frontier models tested solved the exact same set of 9 challenges, suggesting a convergence in capability for these specific tasks around mid-2025.
3. **Cost Escalation in Unbounded Scenarios:** When agents were placed in a 'Broad Scope' scenario—requiring them to independently identify and prioritize targets within a wide domain—performance decreased, and the operational cost increased by a factor of 2 to 2.5 compared to directed tasks.
### Supporting Evidence
- For each solved challenge, an **expected cost per success** was calculated, factoring in the success rate across multiple runs (including failures).
- Unambiguous "win conditions" (flags) were crucial; without them, agents tended to produce false positives, exaggerate severity, and struggle to distinguish meaningful exploitation from noise.
### Novel Contributions
- The creation of a standardized, replicable CTF framework based on 10 specific, enterprise-level vulnerability types to objectively measure autonomous agent capability in offensive security.
- Quantification of the increased operational cost associated with shifting LLM agents from directed exploit execution to autonomous target prioritization and discovery.
## Technical Details
The use of a clear "flag" mechanism addressed a known limitation in AI agent evaluation: the tendency toward producing noisy, incomplete reports lacking a quantifiable success metric. This hard success rubric forced agents to continue working until definitive exploitation was achieved, playing to the AI's strengths in highly structured, goal-oriented tasks.
## Practical Implications
### For Security Practitioners
The research confirms that LLM agents are rapidly approaching human proficiency in executing specific, well-defined cyber attack steps. They serve as powerful force multipliers for pre-scripted or well-understood attack vectors.
### For Defenders
Defenders must recognize that advanced AI agents can rapidly chain together steps for known, high-signal vulnerabilities (like an authentication bypass or S3 takeover) given the correct initial access or prompt direction. Defense strategies must account for this accelerated pace of tactical execution.
### For Researchers
This evaluation highlights that the current difficulty barrier for AI agents is not in executing the exploit, but in the **unguided discovery and prioritization phase** in complex, real-world environments. Future research should focus on improving agent reasoning and noise filtering in broad scopes.
## Limitations
The study acknowledges that its findings represent a snapshot of capabilities in mid-to-late 2025. Furthermore, the CTF environment provided conditions that favored AI performance (clear goals, defined targets, flag success metrics), which likely overrepresented their efficacy compared to true, ambiguous real-world penetration testing or bug bounty scenarios.
## Comparison to Prior Work
This study moves beyond simple code analysis or single-step vulnerability identification by testing full exploitation chains within a semi-realistic, multi-stage CTF setup. It specifically quantifies the performance drop when moving from directed execution (common in earlier agent benchmarks) to autonomous scoping, which is a critical factor for real-world deployment.
## Future Work
The authors and collaborators suggest continued monitoring of frontier models, especially those released after this study, as capabilities are expected to advance rapidly. The open question remains how agents will handle the non-binary, gradual success metrics found in actual security testing.
## References
- Irregular: AI Evaluation Platform Cyber Use Case (Defanged URL Example: `https://www.irregular.com/publications/ai-evaluation-platform-cyber-use-case`)
- Irregular: Emerging Evidence of a Capability Shift (Defanged URL Example: `https://www.irregular.com/publications/emerging-evidence-of-a-capability-shift`)