Full Report
Building an efficient small language model for cybersecurity, from data prep to deployment
Analysis Summary
# Tool/Technique: Fine-tuned Small Language Model (SML) for Secret Detection
## Overview
A specialized Small Language Model (SML), fine-tuned on a Llama 3.2 1B base, designed to detect sensitive credentials (secrets) within code files. This approach provides superior contextual understanding compared to traditional regex methods, while maintaining efficiency and addressing the high computational and privacy costs associated with deploying large language models (LLMs) at enterprise scale.
## Technical Details
- Type: Tool (Specialized AI Model)
- Platform: Not explicitly detailed, but implies environments capable of running CPU-based inference (likely general-purpose servers or pipelines scanning code repositories).
- Capabilities: High-accuracy secret detection with lower false positive rates compared to regex; enables on-premises deployment for privacy compliance.
- First Seen: Not specified, but the research implies deployment/integration into Wiz's solution alongside ongoing development.
## MITRE ATT&CK Mapping
This research focuses on **Defensive Security Measures and Threat Intelligence** rather than an adversary technique. However, the *target* of this tool relates to detecting common techniques used by threat actors:
- **Defense Evasion** (Indirect, by detecting the preparatory stages of an attack)
- T1027 - Obfuscated Files or Information (If secrets are hidden or obfuscated)
- **Credential Access** (The secrets being detected are often the result of this access phase)
- T1552 - Unsecured Credentials
- T1552.001 - Credentials in Files
## Functionality
### Core Capabilities
- **Secret Identification:** Differentiating actual secrets from false positives using contextual understanding derived from code structure.
- **Efficiency:** Achieving performance metrics (86% precision, 82% recall) on a low-parameter model (Llama 3.2 1B) that can run on standard CPU hardware.
- **Scale:** Designed to process millions of code files daily without the exorbitant latency or cost associated with using large commercial LLM APIs.
### Advanced Features
- **LLM-Assisted Labeling:** Utilization of larger LLMs (e.g., Sonnet 3.7) in a multi-agent workflow to generate high-quality, structured metadata for training data labeling, significantly accelerating dataset creation.
- **Contextual Analysis:** Provides context awareness superior to regex, improving segregation between actual secrets and non-sensitive strings.
- **On-Premises Deployment:** Facilitates deployment within customer environments to satisfy strict data privacy and regulatory requirements.
## Indicators of Compromise
This tool's purpose is detection, not compromise, so traditional IoCs are not generated by the tool itself. The *output* relates to detecting:
- File Hashes: N/A (Detects content *within* files)
- File Names: N/A
- Registry Keys: N/A
- Network Indicators: N/A
- Behavioral Indicators: Detection of sensitive strings resembling API keys, tokens, or passwords embedded in source code or configuration files.
## Associated Threat Actors
This technology is being developed and utilized by **Wiz** for defensive purposes, specifically as part of their Data Security Posture Management (DSPM) solution, intended to thwart actors who rely on *stolen credentials* (cited as present in 31% of breaches).
## Detection Methods
The SML itself is the detection engine.
- Signature-based detection: Replaced by advanced model inference, though the training data reflects millions of signatures/patterns previously used in regex.
- Behavioral detection: The model analyzes the structural context of the code surrounding the potential secret.
- YARA rules: Not applicable; this is a machine learning inference system.
## Mitigation Strategies
The implementation of this technology serves as a robust mitigation strategy against credential compromise stemming from code exposure.
- **Prevention Measures:** Proactive scanning of codebases (including full Git history) to identify and remediate exposed secrets before exploitation.
- **Hardening Recommendations:** Leveraging specialized, efficient on-premise AI to maintain continuous, comprehensive secret monitoring without violating compliance boundaries.
## Related Tools/Techniques
- **Traditional Regex Secret Scanning:** The method the SML is explicitly designed to outperform.
- **Large Language Models (e.g., GPT-4o, Claude Sonnet 4):** Larger models whose inference cost and latency are prohibitive for scalable security scanning, but which were used for data labeling.
- **Wiz's Existing Secrets Scanning:** This SML augments existing Wiz capabilities which include automated validity checks and contextualization of findings with cloud/runtime permissions.