Full Report
We put LLMs to the test—let's find out how good AI is at hacking! We walk through six simple challenges with intentionally naïve setups to test how capable each model is at single-step exploit validation.
Analysis Summary
# Research: Benchmarking Self-Hosted LLMs for Offensive Security
## Metadata
- **Authors:** Brandon McGrath
- **Institution:** TrustedSec
- **Publication:** TrustedSec Blog
- **Date:** April 14, 2026
## Abstract
This research evaluates the efficacy of locally-hosted, small-to-medium language models (LLMs) in performing offensive security tasks. While frontier models like GPT-4 and Claude 3.5 have demonstrated significant hacking capabilities, most existing benchmarks rely on these cloud-based models or include "hints" in prompts. This study utilizes a custom testing harness to subject six open-weight models to a series of intentionally naïve, single-step exploit validation challenges based on the OWASP Juice Shop environment to determine if "local" AI can provide actionable utility for penetration testers without relying on external APIs.
## Research Objective
The research addresses the gap in benchmarking local AI models for offensive security. It specifically asks: Can self-hosted LLMs (ranging from 7B to 31B parameters) accurately identify and exploit vulnerabilities when provided with only a raw HTTP request/response and a simple objective, without the crutch of embedded hints or frontier-model reasoning?
## Methodology
### Approach
The researcher developed a Python-based harness that provides the LLM with a specific objective (e.g., "Find the administration page") and the raw HTTP content of a page. The models were tested on their ability to generate a single, valid follow-up HTTP request or a winning payload to achieve the objective in one step.
### Dataset/Environment
The environment utilized the **OWASP Juice Shop**, a purposefully insecure web application. Six challenges were selected:
1. Identifying an Administration Page.
2. Exploiting a Login via SQL Injection (SQLi).
3. Accessing sensitive `/ftp` directories.
4. Exploiting an IDOR (Insecure Direct Object Reference) in user profiles.
5. Exploiting a Cross-Site Scripting (XSS) vulnerability.
6. Forging a JWT (JSON Web Token) to gain admin access.
### Tools & Technologies
- **Models:** Llama-3.1-8b, Mistral-Nemo-12b, Gemma-2-27b, Qwen-2.5-32b, DeepSeek-Coder-V2-Lite (16b), and Phind-CodeLlama-34b.
- **Infrastructure:** Local hosting via Ollama/vLLM.
- **Challenge Platform:** OWASP Juice Shop via Docker.
## Key Findings
### Primary Results
1. **Inconsistency in Simple Tasks:** Local models struggled significantly more than frontier models. Even identifying a simple `/administration` link was not a 100% success rate across all models.
2. **SQLi and XSS Success:** Most "coding-centric" local models successfully identified and exploited basic SQL injection and XSS, provided the context was narrow.
3. **Failure on Complex Logic:** Challenges involving JWT forging or multi-step logic (like IDOR) showed a high failure rate, often resulting in "hallucinated" parameters or malformed JSON.
4. **Qwen and Gemma Dominance:** Qwen-2.5-32b and Gemma-2-27b emerged as the strongest performers among the local cohort for offensive reasoning.
### Supporting Evidence
- Statistical success rates varied, with high-parameter models (27b-32b) achieving ~60-70% success on simple web challenges, while 7b-8b models fell below 30% on non-trivial tasks.
### Novel Contributions
- **Anti-Hint Benchmarking:** Unlike many academic papers that provide LLMs with "CTF-style" hints, this research utilized "naïve setups," forcing the model to rely solely on the raw technical data provided.
- **Local Focus:** Provides a specific performance baseline for practitioners who cannot use cloud AI due to data privacy or engagement restrictions.
## Technical Details
The harness utilized a "Zero-Shot" prompting technique where the model's system prompt defined it as a "Senior Penetration Tester." The input format was strictly:
- **Context:** Raw HTTP Response Headers & Body.
- **Instruction:** "Generate the next HTTP request to [Objective]."
- **Constraint:** "Output only the HTTP request."
## Practical Implications
### For Security Practitioners
- Local LLMs can currently serve as "smart grep" tools—excellent for spotting obvious vulnerabilities in caught traffic but unreliable for complex exploitation.
### For Defenders
- The ability of 30B-class models to generate working SQLi and XSS payloads confirms that automated, AI-driven "script kiddie" style attacks are becoming computationally cheap and locally hostable.
### For Researchers
- There is a clear "intelligence threshold" around 20B+ parameters where models transition from purely predicting text to exhibiting basic security reasoning.
## Limitations
- **Single-Step Only:** The research restricted models to one interaction, which does not account for the "agentic" iterative loops that might improve success.
- **Known Targets:** Juice Shop is well-represented in training data; models might be recalling specific solutions rather than "reasoning" through the exploit.
## Comparison to Prior Work
Building on the "Fang et al. (2024)" paper regarding one-day vulnerabilities, this study proves that while frontier models (GPT-4) are highly capable, the current generation of *local* models still lags significantly, particularly in maintaining state and complex payload construction.
## Real-world Applications
- **Internal Tooling:** Can be integrated into proxies (like Burp Suite) to provide autonomous suggestions for secondary requests.
- **Red Teaming:** Automated scanning of internal documentation or codebases where data cannot leave the network.
## Future Work
- Evaluation of **Fine-tuned models** (e.g., xOffense) specifically trained on exploit datasets.
- Testing the impact of **Chain-of-Thought (CoT)** prompting on local model performance for complex tasks like JWT forgery.
## References
- Fang et al., 2024: *LLM Agents can Autonomously Exploit One-day Vulnerabilities* (arXiv:2404.11625)
- Happe, 2025: *Benchmarking Practices in LLM-driven Offensive Security* (arXiv:2504.10112v2)
- [trustedsec[.]com/blog/benchmarking-self-hosted-llms-for-offensive-security]