Full Report
Statistical analysis is used all the time in computer science for solving hard problems. In particular machine learning has hit a big boom lately. Sometimes, simple statistical analysis can be used to solve hard problems instead of the insanity of LLMs. In this post, we get one of those. n-gram statistical analysis is common in linguistics. Simply put, it uses a grouping of tokens, such as words, and shows what the likelihood of this is to occur. Based upon this, it's possible to predict words in linguistics by using the most likely next word. The author has chosen to use this technique for binary analysis on machine code. From testing, they figured out that 3-grams work well without over fitting. I'm guessing they tried this with several different N-gram amounts for analysis. Previous work has shown the ability to identify both anomalies in code and find patterns to help reverse engineer unknown ISAs. To do this analysis, the author lifted the binary into a binary ninja intermediate language. Additionally, they removed registers and memory addresses to make it more generalized. From this, they analyzed a large amount of binaries to get a ground truth. Now, they can start analyzing new binaries to look for anomalies! While looking into malware, they were able to identify control-flow flattening obfuscation techniques. Every function identified by the heuristic is obfuscated or pinpoint a helper function managing the obfuscated state. In the Windows kernel, they analyzed the Warbird Virtual machine. By finding an obscure pattern of code in the asm, they were able to find VM handlers that were obfuscated in the VM. They analyzed Mobile DRM that plays encrypted multi-media content. Using it, they were able to identify arithmetic obfuscated areas via Mixed Boolean Arithmetic and usages of hardware encryption. This was enough to demonstrate they were looking in the proper area. Stats don't lie! Statistics is useful for many things, including binary analysis. Great post on using techniques from other disciplines in the realm of security.
Analysis Summary
# Research: Statistical Analysis to Detect Uncommon Code
## Metadata
- **Authors:** Tim Blazytko
- **Institution:** Independent Researcher (Synthesis.to)
- **Publication:** [Synthesis.to Blog](https://synthesis.to/2023/01/26/uncommon_code.html)
- **Date:** January 26, 2023
## Abstract
This research explores the application of n-gram statistical analysis—a technique rooted in computational linguistics—to binary analysis. By treating instruction sequences as "grams," the author develops an architecture-agnostic heuristic to identify uncommon or "anomaly" code patterns. The method effectively pinpoints obfuscated code, such as control-flow flattening and Mixed Boolean-Arithmetic (MBA), without relying on traditional complexity metrics.
## Research Objective
The research aims to determine if simple statistical patterns (n-grams) can be used to identify interesting or "weird" code segments in a binary. Specifically, it seeks to automate the detection of obfuscation and domain-specific logic (like DRM or VM handlers) by highlighting code that deviates from "natural" assembly frequency distributions.
## Methodology
### Approach
The researcher utilizes **n-gram analysis**, specifically focusing on **3-grams** (sequences of three tokens). To ensure the analysis focuses on logic rather than specific addresses:
1. **Lifting:** Binaries are lifted into the **Binary Ninja Intermediate Language (BNIL)**.
2. **Normalization:** Registers, immediate values, and memory addresses are stripped or generalized to create abstract instruction tokens.
3. **Sliding Window:** A sliding window of size $n$ moves through the instructions to count occurrences.
4. **Scoring:** Functions are scored based on the rarity of the n-grams they contain compared to a pre-defined "ground truth" distribution.
### Dataset/Environment
- **Ground Truth:** A large corpus of "clean" binaries across multiple architectures to establish baseline instruction frequencies.
- **Test Samples:** Included malware with control-flow flattening, Windows Kernel modules (Warbird VM), Anti-cheat software, and Mobile DRM systems.
- **Architectures:** x86, x86-64, ARM32, and AARCH64.
### Tools & Technologies
- **Binary Ninja:** As the primary reversing platform and IL provider.
- **Obfuscation Detection Plugin:** An architecture-agnostic plugin developed by the author to implement these heuristics.
## Key Findings
### Primary Results
1. **3-Grams are the "Sweet Spot":** Testing revealed that $n=3$ provides sufficient context to identify meaningful patterns without overfitting (which occurs at higher $n$ values).
2. **Obfuscation Detection:** The heuristic successfully flagged every function involving control-flow flattening or associated state-management helper functions.
3. **Pattern Identification:** The tool identified Mixed Boolean-Arithmetic (MBA) and hardware encryption calls in mobile DRM by flagging their rare arithmetic sequences.
### Supporting Evidence
- **Warbird VM:** In the Windows kernel, the tool identified obscure ASM patterns that pinpointed VM handlers despite heavy obfuscation.
- **Consistency:** Rare code sequences consistently correlated with either highly optimized "hot paths" or intentionally obfuscated code.
### Novel Contributions
- **Linguistic Logic applied to Binaries:** Moving away from complexity-based metrics (like cyclomatic complexity) toward frequency-based anomaly detection.
- **Architecture Agnosticism:** By using Intermediate Language (IL) normalization, the same statistical model works across different CPU architectures.
## Technical Details
The core innovation lies in the **Normalization Layer**. By converting `mov eax, 0x10` and `mov ebx, 0x20` into a generic `mov (reg, imm)` token, the researcher reduces the "alphabet" of the language. This allows the statistics to focus on the *functional flow* of instructions. Rare 3-gram sequences—such as specific chains of bitwise shifts and rotates rarely seen in compiler-generated code—serve as "signatures" for Mixed Boolean-Arithmetic (MBA).
## Practical Implications
### For Security Practitioners
- **Faster Triage:** Analysts can immediately jump to the "weirdest" functions in a large binary to find protection mechanisms or core logic.
- **Tooling Integration:** The method can be integrated into CI/CD pipelines to flag unexpected code changes that look like "malware-like" obfuscation.
### For Defenders
- **Detection Bypass Identification:** Defenders can use this to identify where malware authors have used custom packers or "junk code" insertion that breaks standard signature-based detection.
### For Researchers
- **Cross-Architecture Patterns:** Research shows that while ISAs differ, the "linguistics" of compiled code are remarkably similar across architectures when viewed through n-grams.
## Limitations
- **Overfitting:** Higher n-gram values (n > 5) become too specific to a single compiler version or optimization flag.
- **Data Dependency:** The effectiveness relies on a robust "ground truth" dataset; if the baseline is too small, benign code might be flagged as an anomaly.
## Comparison to Prior Work
Unlike previous heuristics that focused on **Complexity Metrics** (e.g., number of branches, instruction density), this work focuses on **Statistical Frequency**. It complements earlier work on control-flow flattening detection by adding a "probabilistic" dimension to the analysis.
## Real-world Applications
- **Malware Analysis:** Identifying custom crypters or obfuscation layers.
- **Vulnerability Research:** Finding "uncommon" code paths that may not have been as heavily tested as standard compiler output.
- **DRM/Anti-Cheat:** Locating the core "check" logic and VM-based protection layers in commercial software.
## Future Work
- Exploring whether **Skip-grams** (n-grams where some tokens are ignored) could better handle "junk code" insertion.
- Automating the classification of *types* of obfuscation based on the specific "uncommon" patterns found.
## References
- Blazytko, T. (2021). *Heuristics for Control-Flow Flattening Detection*. [synthesis[.]to]
- Blazytko, T. (2021). *Practical MBA Deobfuscation*. [synthesis[.]to]
- Binary Ninja Documentation. [binary[.]ninja]