Full Report
The author of this post wanted to see the capabilities of Opus 4.5 and GPT-5.2 when exploiting new vulnerabilities in the QuickJS JavaScript interpreter. They included many different challenges, such as various exploit mitigations and different target goals. Out of the 40 distinct exploits, GPT solved every scenario and Opus solved all but 2. These are the results of the experiment. The vulnerability itself was documented at the beginning. Very quickly, both agents turned the QuickJS vulnerability into a read/write primitive API, making exploitation easier. From there, it leveraged known public weaknesses to build an exploit chain. In the hardest test, they included everything you could think of: fine-grained CFI, shadow-stack, seccomp sandbox, and more. GPT-5.2 created a chain of 7 function calls through glibc's exit handler to pop a shell on the hardest challenge with 50M tokens and $150. The author found the vulnerability with an AI agent and then wrote an exploit using it as well. So, now what? The industrialization of exploitation. Now, the ability of an organization to complete a task will be restricted by the number of tokens it can afford, NOT by the number of people. According to the author, exploit dev is perfect for industrialization. The environment is easy to construct. The tools are well understood, and verification is straightforward. The information is out there, and people know how to do this. The limitation tends to be on how many things a person can try and their hours; the computer is not limited by these. This shows that new security issues can be exploited by LLMs because of their massive knowledge of the exploit game. They included source code for these agents as well.
Analysis Summary
# Research: On the Coming Industrialisation of Exploit Generation with LLMs
## Metadata
- **Authors:** Sean Heelan
- **Institution:** Independent Researcher (Sean Heelan’s Blog)
- **Publication:** sean.heelan.io
- **Date:** January 18, 2026
## Abstract
This research explores the capabilities of frontier Large Language Models (LLMs)—specifically Claude Opus 4.5 and GPT-5.2—in automating the end-to-end exploitation of zero-day vulnerabilities. By deploying these models as autonomous agents against the QuickJS JavaScript interpreter, the author demonstrates that LLMs can independently move from source code analysis to the creation of complex, multi-stage exploit chains. The study concludes that exploit development is transitioning from a human-constrained craft to an "industrialized" process limited primarily by computational budget (tokens) rather than human labor.
## Research Objective
The study aims to determine if current frontier LLMs can autonomously exploit a zero-day vulnerability in a real-world software target (QuickJS) under varying levels of technical difficulty, including modern exploit mitigations and sandboxing.
## Methodology
### Approach
The author developed autonomous agent frameworks that utilized LLMs to perform iterative "trial and error" cycles. The agents were tasked with:
1. Analyzing the QuickJS source code to understand a zero-day vulnerability.
2. Building an "API-like" primitive for arbitrary memory read/write.
3. Chaining these primitives to bypass security mitigations to achieve specific goals (e.g., shell access, file writes).
### Dataset/Environment
- **Target:** QuickJS JavaScript interpreter (a real-world, though lower-complexity, interpreter).
- **Scenarios:** 40+ distinct exploit scenarios across 6 difficulty categories.
- **Constraints:** Varied heap states, no hardcoded offsets, and restricted system calls.
### Tools & Technologies
- **LLMs:** OpenAI GPT-5.2 and Anthropic Claude Opus 4.5.
- **Security Mitigations:** ASLR, NX, Full RELRO, fine-grained Forward-edge CFI, Hardware-enforced Shadow Stack, and Seccomp sandboxing.
## Key Findings
### Primary Results
1. **High Success Rate:** GPT-5.2 solved 100% of all (40) exploit challenges; Opus 4.5 solved all but two.
2. **Rapid Primitive Development:** Both agents quickly abstracted the vulnerability into a reusable read/write primitive API without human intervention.
3. **Complex Chain Construction:** In the most difficult task, GPT-5.2 successfully chained 7 different function calls through the `glibc` exit handler to bypass shadow stacks and CFI.
4. **Economic Efficiency:** The most complex exploit cost approximately $150 in tokens and was completed in roughly 3 hours.
### Supporting Evidence
- **Cost/Performance Ratio:** Most challenges were solved for <$30 USD in tokens.
- **Successful Reproduction:** The author released the "Anamnesis" codebase to allow for verification of these results.
### Novel Contributions
- **Demonstration of Industrialization:** Proof that the bottleneck for high-end exploit dev is shifting from "talented humans" to "token throughput."
- **Zero-Day Autonomy:** Successful exploitation of an undocumented vulnerability rather than a known CVE (avoiding training data leakage issues).
## Technical Details
The agents faced a "hardened" build of QuickJS where standard ROP (Return-Oriented Programming) was neutralized by a hardware shadow stack, and shell execution was blocked by Seccomp. GPT-5.2 circumvented these by identifying an alternative execution path: it manipulated `glibc`'s `exit` handlers. By overwriting function pointers triggered during process termination, the agent executed a series of non-blocked system calls to write to the filesystem, demonstrating a sophisticated understanding of Linux process internals.
## Practical Implications
### For Security Practitioners
- **Shift in Threat Profile:** High-quality exploits for non-tier-1 targets (IoT, embedded systems) can now be generated at scale for nominal costs.
- **Automated "Exploit APIs":** The ability of LLMs to create "Read/Write APIs" means vulnerability researchers can focus on high-level logic while the LLM handles the "plumbing" of the memory corruption.
### For Defenders
- **Urgency of Mitigations:** Traditional mitigations (CFI, Shadow Stack) are necessary but insufficient; LLMs are highly proficient at finding the "gaps" in these protections.
- **Patch Speed:** The window between vulnerability discovery and weaponized exploit is shrinking toward zero.
### For Researchers
- **Real-world Testing:** Moving beyond CTFs (Capture The Flag) is essential; researchers should test agents against "hard" targets like the Linux kernel or modern browsers to find the true ceiling of LLM capability.
## Limitations
- **Target Complexity:** QuickJS is significantly smaller and less complex than "Tier 1" engines like V8 (Chrome) or SpiderMonkey (Firefox).
- **No Generic Breaks:** The LLMs did not "break" encryption or CFI logic; they found legal paths through the protections that humans had previously documented as theoretical weaknesses.
- **Cost Ceiling:** While $150 is cheap for an exploit, the token usage for a browser-grade exploit might be orders of magnitude higher.
## Comparison to Prior Work
Unlike previous studies focusing on "Cyber CTFs" or reproducing old CVEs where the solution likely existed in the training data, this work utilized a zero-day vulnerability and forced the agent to perform original discovery and engineering.
## Real-world Applications
- **Automated Red Teaming:** Scaling offensive security testing across massive codebases where human testers are too expensive.
- **IoT Vulnerability Research:** Rapidly generating exploits for the "long tail" of less-secured embedded devices.
## Future Work
- **Evaluating Tier 1 Targets:** Testing if massive token investment (billions of tokens) can crack hardened targets like the Linux kernel or Firefox.
- **Integration with Fuzzing:** Combining LLM exploit generation with AI-driven vulnerability discovery (e.g., Project Aardvark/Team Tiamat).
## References
- Heelan, S. (2026). *Anamnesis Release*. [github[.]com/SeanHeelan/anamnesis-release/]
- Heelan, S. (2026). *On the Coming Industrialisation of Exploit Generation with LLMs*. [sean[.]heelan[.]io]