Full Report
The CCN project is co-financed by the European Regional Development Fund and the State Budget under the European Funds for Digital Development Programme 2021-2027. Fuzzing is an automated software testing technique that involves feeding random or deliberately malformed input data to detect bugs and security vulnerabilities. For years it has …
Analysis Summary
# Research: Autonomous fuzzing process under LLM supervision
## Metadata
- **Authors:** CERT Polska / NASK Cybersecurity Center
- **Institution:** Fuzzing and Malware Analysis Laboratory (FUMAL), NASK
- **Publication:** CERT Polska Expert Blog/Publications
- **Date:** May 21, 2026 (Projected/Reported)
## Abstract
This research introduces **fuzzlab**, an autonomous system designed to eliminate the manual bottlenecks of software fuzzing. By integrating Large Language Models (LLMs) and Machine Learning (ML) as supervisory layers, the system automates code analysis, test generation, crash classification, and report writing. The research demonstrates that LLMs, when constrained by structured procedures rather than left to improvise, can effectively manage high-scale vulnerability discovery in complex open-source projects.
## Research Objective
The study addresses the "human bottleneck" in fuzzing: while the execution of fuzz tests is cheap, the preparation (documentation review, writing harnesses, configuring detectors, and triaging results) is resource-intensive. The research asks if an LLM-supervised pipeline can autonomously maintain a continuous testing loop without human intervention.
## Methodology
### Approach
The researchers developed a modular architecture where an LLM serves two primary functions:
1. **Specialized Operator:** Handles discrete tasks like filtering data, generating test harnesses, and classifying crashes.
2. **Pipeline Supervisor:** Monitors the process for anomalies (e.g., coverage drops, build errors) and applies self-correction or procedural improvements.
### Dataset/Environment
- **Included Languages:** C, C++, Python, and Go.
- **Targets:** Widely used open-source software, including ModSecurity (WAF engine) and Oracle VirtualBox.
- **Scale:** 2,786 test programs in rotation across 2,057 completed cycles.
### Tools & Technologies
- **fuzzlab:** The core platform (4 Python modules).
- **LLM Integration:** Provider-agnostic interface (supports local or cloud models).
- **ML Models:** Local per-project and global models for risk prediction and ranking.
- **Data: Over 16 million corpus files and 5 million samples for ML training.**
## Key Findings
### Primary Results
1. **High Efficacy:** Identified real-world vulnerabilities in major projects (Oracle VirtualBox, ModSecurity).
2. **Autonomous Error Correction:** The system successfully self-repaired build errors and misconfigurations that would typically stall a human-led campaign.
3. **Scalable Classification:** Processed over 100,000 "raw" crashes to isolate 696 unique cases, effectively distinguishing between 550 real bugs and 141 false positives.
### Supporting Evidence
- **Global ML Model:** Achieved an average AUC (Area Under Curve) of 0.947 over 5 million samples.
- **Per-Project ML Model:** Reached a high precision AUC of 0.981 across 16,805 training sessions.
### Novel Contributions
- **Boundary-Based LLM Logic:** Instead of using LLMs as general agents, they are used within strict "standardized interfaces," reducing hallucinations.
- **Self-Evolving Loop:** A feedback loop where the system retrains its own ML models based on the findings of previous iterations.
## Technical Details
The architecture separates the "execution" (the fuzzer) from the "logic" (the LLM/ML). The Python-based controller uses structured data (JSON/Protobuf) to communicate with the LLM. This prevents the LLM from drifting into irrelevance and allows it to act as a "specialized operator." For example, if code coverage drops, the supervisor LLM analyzes the logs, identifies a blocked execution path, and modifies the test harness to bypass the bottleneck.
## Practical Implications
### For Security Practitioners
- **Reduced Lead Time:** The time from "new code" to "vulnerability report" is significantly shortened.
- **Automation of Triage:** Practitioners can focus on patching rather than sorting through thousands of redundant crash logs.
### For Defenders
- **Continuous Security:** Fuzzing can be integrated into CI/CD pipelines as a self-healing process that adapts as the codebase changes.
### For Researchers
- **Shift in Focus:** Research can move from "how to find bugs" to "how to build better supervisory logic" for autonomous agents.
## Limitations
- **Proof-of-Concept Status:** The project is currently in the PoC phase; long-term stability in diverse environments is still being tested.
- **Ranking vs. Absolute Quality:** While AUC is high, the authors note it doesn't guarantee performance at the extreme "top" of the priority list for computational budget allocation.
## Comparison to Prior Work
Unlike "classic" fuzzing (e.g., AFL, libFuzzer) which requires manual harness writing, or Google's OSS-Fuzz which still requires significant developer setup, **fuzzlab** aims for a "zero-touch" lifecycle where the AI handles the bridge between raw source code and a final vulnerability report.
## Real-world Applications
- **Infrastructure Protection:** Hardening WAFs and Hypervisors (as seen with ModSecurity and VirtualBox).
- **Supply Chain Security:** Mass-scanning open-source libraries that underpin modern software stacks.
## Future Work
- Improving the "top-tier" ranking accuracy of the ML models.
- Expanding the library of supported languages and complex execution environments.
- Exploring fully closed-loop systems where the AI also proposes code patches.
## References
- CERT Polska / NASK FUMAL Research Documentation.
- Fuzz Introspector (OSS-Fuzz integration).
- Relates to: [https://cert.pl/en/tag/fuzzing/](https://cert.pl/en/tag/fuzzing/)