Full Report
Input validation is a crucial part of web application security. However, with all of the data parsing there are a multitude of ways this could go wrong. Finding a different endpoint, bypassing the regex... lots of different ways. In this post, the article goes through a technique called normalization. This is the process of translating data into a more understandable format. For instance, going changing capitalization is a format of translation. Some steps are obviously for translation but others are for general string handling. For instance, calling unidecode in Python with a string can change the string in unexpected ways. When dealing with regex parsing, string parsing and everything else, different representations slip through the cracks. For instance let's take the regex ^(?:https?:\/\/)?(?:[^\/]+\.)?example\.com(?:\.*)?$. This is meant for parsing URLs that start with example.com. The text https://example՟com will be accepted by regex as a domain argument then translated to something entirely different in punycod, causing a crazy bypass. How did they find this out? Using their new tool Recollapse. This is a blackbox regex fuzzer! This tool seems pretty rad for finding regex parsing issues. To do this, choose separator points and normalization points. Then, mess with the regex until something goes through. They have some real world examples at here from a talk. The first interesting one was a redirect URI for OAuth. Using anything besides the standard URL caused issues. However, by fuzzing away at the API, they found that %3b%40 or ;@ was able to bypass the redirect link parsing but STILL go to our endpoint. They used this to cause cache confusions, shopify account takeover and many other bugs. The tool looks pretty easy to use as well, which is awesome. Parsing differences between two different system will always be a problem!
Analysis Summary
# Tool/Technique: REcollapse
## Overview
REcollapse is a black-box fuzzer/generator designed to identify vulnerabilities in web applications caused by improper input validation, weak regular expressions, and inconsistent normalization. It aids researchers in finding bypasses for security controls like Web Application Firewalls (WAFs), URL whitelists, and email validation logic by generating permutations of input that "collapse" into a target string after being processed by the backend.
## Technical Details
- **Type**: Tool / Attack Technique (Fuzzing/Normalization Bypass)
- **Platform**: Web-based applications and APIs (Backend agnostic, includes Python, PHP, etc.)
- **Capabilities**: Regex fuzzing, normalization testing, Punycode generation, and encoding manipulation.
- **First Seen**: November 2022 (Official public release)
## MITRE ATT&CK Mapping
- **TA0001 - Initial Access**
- **T1190 - Exploit Public-Facing Application**: Using REcollapse to find bypasses in validation logic to exploit application vulnerabilities.
- **TA0006 - Credential Access**
- **T1550 - Use Alternate Authentication Material**: Facilitating account takeovers (ATO) via normalization confusion in password reset or OAuth flows.
- **TA0005 - Defense Evasion**
- **T1564 - Hide Artifacts**: Using obscure Unicode characters that "normalize" to standard characters to bypass WAF signatures.
## Functionality
### Core Capabilities
- **Regex Fuzzing**: Generates inputs specifically designed to break common regular expressions (e.g., email or URL validators) sourced from StackOverflow or GitHub Copilot.
- **Normalization Analysis**: Identifies how backends transform special characters (e.g., `Ãéï°` to `Aeideg`) using libraries like Python's `unidecode` or PHP's `iconv`.
- **Separator/Normalization Point Identification**: Users define points in a string where a regex might be vulnerable (like a dot in a domain) to generate targeted payloads.
### Advanced Features
- **Punycode Exploitation**: Generates malicious domains (e.g., `example՟com`) that bypass filters but resolve to attacker-controlled infrastructure when normalized.
- **Account Takeover (ATO) Vectors**: Creates collisions where an attacker’s email (e.g., `hil°[email protected]`) normalizes to a victim’s email (`[email protected]`), potentially hijacking password recovery links.
- **Cache Confusion**: Uses specific character sequences (e.g., `%3b%40` or `;@`) to bypass redirect parsers while maintaining functionality, leading to web cache deception or shopify account takeovers.
## Indicators of Compromise
- **File Names**: `recollapse` (binary/script name).
- **Network Indicators**:
- Request patterns containing sequences like `%3b%40` or unusual Unicode characters (e.g., `՟`).
- Access to Punycode domains: `xn--examplecom-ehl[.]com`.
- **Behavioral Indicators**:
- High-volume requests to `/signup`, `/login`, or `/password-reset` with slight variations in character encoding (typical of `ffuf` or `Burp Intruder` usage with REcollapse payloads).
## Associated Threat Actors
- **Security Researchers/Bug Bounty Hunters**: Primarily used by the ethical hacking community (e.g., 0xacb).
- **Generic Web Attackers**: The tool is open-source and can be used by any actor targeting web application logic.
## Detection Methods
- **Signature-based detection**: Monitor for common REcollapse payload patterns in HTTP parameters, specifically Punycode and high-bit Unicode characters in email/URL fields.
- **Behavioral detection**: Identify "Normalizer Collisions" where two different raw input strings result in the same database entry or session state.
- **WAF Rules**: Implement rules that flag suspicious URL separators or unexpected normalization-prone characters in critical input fields.
## Mitigation Strategies
- **Consistent Normalization**: Ensure that data is normalized **before** validation occurs, and that the same normalization library is used across all microservices/endpoints.
- **Robust Regex**: Avoid copy-pasting regex; use well-maintained, standard libraries for URL and email parsing.
- **Whitelisting**: Use strict alphanumeric whitelists where possible and reject inputs containing non-standard Unicode characters in sensitive fields.
## Related Tools/Techniques
- **ffuf**: Often used in conjunction with REcollapse to send generated payloads.
- **Burp Intruder**: Used to automate the delivery of REcollapse-generated fuzzing strings.
- **Punycode Phishing**: A related technique using similar character confusion for social engineering.