Full Report
Following on from our preview, here's Ben Gelman and Sean Bergeron's research on enhancing command line classification with benign anomalous data
Analysis Summary
# Research: Sophos AI at Black Hat USA ’25: Anomaly detection betrayed us, so we gave it a new job – Enhancing command line classification with benign anomalous data
## Metadata
- Authors: Ben Gelman, Sean Bergeron
- Institution: Sophos AI (Sophos X-Ops)
- Publication: Sophos News (Presented at Black Hat USA 2025)
- Date: August 07, 2025
## Abstract
This research addresses the high false-positive rates endemic to traditional anomaly detection applied to command line activity classification. The authors propose a novel paradigm shift: instead of using anomaly detection primarily to flag malicious behavior, they leverage its ability to identify *rare and complex benign* command line variations. By pairing anomaly detection with high-precision Large Language Model (LLM) labeling, they create a rich dataset of diverse, non-malicious commands. This augmented, high-quality benign dataset is then used to train supervised command-line classifiers, significantly reducing false positives and improving resilience without relying on scarce malicious examples.
## Research Objective
The primary objective is to overcome the practical shortcomings of unsupervised anomaly detection in command line security—specifically, its tendency to generate excessive false positives—by repurposing anomaly detection specifically to enhance the training data for supervised classification models. The research seeks to improve command line classification accuracy and reduce operational overhead by increasing the coverage of the benign command space.
## Methodology
### Approach
The methodology pivots on changing the role of anomaly detection:
1. **Identify Anomalies:** Employ anomaly detection techniques on large streams of production command line telemetry to flag data points deviating from the established norm.
2. **LLM-based Benign Labeling:** Use advanced LLMs (specifically mentioning OpenAI’s o3-mini model) to automatically and precisely label these detected anomalies. The focus here is achieving near-perfect precision in identifying the *benign* nature of these anomalies.
3. **Data Augmentation:** Integrate this newly labeled, diverse set of benign anomalous data to augment existing supervised command-line classifiers.
4. **Evaluation:** Assess the resulting decrease in false-positive rates for the enhanced supervised models.
### Dataset/Environment
The research utilized over 50 million daily command lines derived from real production telemetry data. The environment involved applying anomaly detection techniques followed by automated LLM-based labeling processes.
### Tools & Technologies
- **Anomaly Detection:** Used as a data discovery mechanism.
- **Large Language Models (LLMs):** Specifically, OpenAI's o3-mini model, employed for high-precision automated labeling ($\text{near-perfect precision}$).
- **Supervised Classifiers:** The target models enhanced by the augmented training data.
## Key Findings
### Primary Results
1. **Re-purposing Anomaly Detection:** Anomaly detection proved highly effective not at finding malicious commands in this context, but at reliably highlighting a significantly diverse set of benign command lines that traditional, frequency-based labeling methods miss.
2. **High-Precision Benign Labeling:** LLMs were able to label these anomalies with remarkably high precision, confirming the benign nature of the complex edge cases identified by anomaly detection.
3. **Significant False Positive Reduction:** Leveraging this diverse benign data substantially reduced the false-positive rates in the subsequent supervised command-line classification models.
4. **Efficiency Gain:** The method allows for leveraging plentiful existing production data, bypassing the "needles in a haystack" problem associated with waiting for rare malicious command lines to appear in the operational stream for labeling.
### Supporting Evidence
- The process achieved "near-perfect precision" when using LLMs to label the anomalies identified by the detection system as benign.
### Novel Contributions
- **Paradigm Shift in Anomaly Detection Use:** The core innovation is shifting the application of anomaly detection from being a primary, unsupervised malicious alert generator (which suffers from noise) to being a specialized, supervised data synthesizer for *benign* behavior coverage.
- **Bridging the Benign Data Gap:** The research effectively addresses the problem where traditional benign labeling only captures simple, frequent behavior, thereby allowing sophisticated benign commands to be misclassified as malicious.
## Technical Details
The research highlights a specific tradeoff faced by practitioners: relying on costly labeled data versus noisy unsupervised detection. Traditional benign labeling favors high-frequency, low-complexity commands. This research introduced a pipeline where anomaly detection identifies the complex, low-frequency benign commands (the "edge cases"). These edge cases are then vetted by a powerful LLM for definitive benign classification, creating a richer training set that better represents the true operational environment. This augmented data is then used to fine-tune a dedicated, supervised classifier.
## Practical Implications
### For Security Practitioners
- It provides a validated strategy to significantly decrease the noise floor (false positives) associated with command line monitoring without requiring extensive manual review of benign activity.
- It offers a scalable method to improve the robustness of existing detection methodologies by diversifying the training data to include complex, rare benign activity.
### For Defenders
- **Actionable Insight:** Defenders should investigate using anomaly detection outputs, not just as final alerts, but as sources for high-value, diverse benign training data when combined with reliable automated labeling (like advanced LLMs).
- **Improved Resilience:** Classifiers trained on this diverse view of benign execution will be better equipped against novel but legitimate administrative or software deployment commands that might otherwise trigger alerts.
### For Researchers
- This work encourages future exploration into leveraging unsupervised methods to enrich supervised datasets, particularly where obtaining targeted malicious labels is inherently difficult or slow.
## Limitations
The provided text does not explicitly list limitations, but implicitly, reliance on external, advanced LLMs for labeling introduces potential vendor lock-in, cost considerations, and potential latency in the labeling pipeline. Furthermore, the success hinges on the LLM's ability to maintain "near-perfect precision" across all evaluated anomalies.
## Comparison to Prior Work
Traditional cybersecurity approaches necessitate a choice between costly, high-precision supervised labeling or scalable but noisy unsupervised anomaly detection. This research diverges by integrating the two: anomaly detection (prioritizing novelty) is used to *find* data, while LLM-based analysis (prioritizing accuracy) is used to *label* that novelty correctly as benign, thereby strengthening the supervised system rather than operating as a standalone detector.
## Real-world Applications
- **Endpoint Detection and Response (EDR) Improvements:** Direct application to command line logging modules in EDR tools to lower alert fatigue.
- **Security Orchestration, Automation, and Response (SOAR):** Creating more reliable inputs for automated triage processes by reducing benign noise.
- **Implementation Considerations:** Requires integrating access to capable LLMs for the automated vetting stage, necessitating careful consideration of data flow and privacy constraints if proprietary command data is used for labeling.
## Future Work
The summary implies future work would logically involve testing this augmented training methodology across different categories of command line usage, exploring the required volume of benign anomalies needed for saturation, and assessing the performance against zero-day or highly obfuscated attacks that are still designed to mimic complexity.
## References
- Research linking anomaly detection in cybersecurity: [arXiv:2412.04259](https://arxiv.org/abs/2412.04259)
- Related presentation: Black Hat USA 2025 Briefing on the topic.