Full Report
The author discusses how different syntaxes by different parsers can lead to security issues. URLs, URIs, content disposition headers, Unicode, etc. are great examples of this. In Python, the urlopen function can read local files, for instance. CVE-2023-24329 showed that a space at the beginning of a URL could trigger a SSRF if using blocklisting. The point is that parser differentials can lead to horrible security issues. They have several examples in their bug bounty life. They had a cache poisoning issue where only the URL port was being cached. When sending specific ports, like 80 or 443, the application removed the port. When using a huge port number, the port was kept on the domain though. The goal was to get the server-side parser to treat the port as invalid before normalization but for the client/browser to see it as valid. When using leading zeros on the port, they noticed this had some weird effects. For instance, the server would use http://example.com:000123:443, parse out http://example.com:000123, and then the browser would interpret this as http://example.com:123. The difference here was between the browser and the PHP backend. The next vulnerability took 3 months of work to exploit. They had control over a URL, and this would return a response from a PHP CURL request. They learned that providing the @ character and a path that started with /tmp allowed them to read files from the file system in the file upload code. However, the data was BLIND, since the file contents were being added to the $_FILES global variable. If sent with multipart/form-data, the contents go into the $_POST variable but with no control of the file name. They messed around with the Content-Disposition header to make this possible. They had the source code for this application, so they were able to see the sinks of this. The confusion happens in the second request. By adding a double quote to the request in the name, it reads the contents of /etc/passwd. Since the username parameter was the closest thing to the file contents, the file was added to the variable and returned in PHP. The rest of the data is effectively ignored because it's a very nice parser. This would eventually return the contents of /etc/passwd to the user, demonstrating a full file read via SSRF. The key was bypassing the $_FILES variable restriction to inject the file contents directly into the $_POST parameter. To mitigate these types of issues, they had a few suggestions. First, have a single consistent parser for handling input. Realistically, this is impossible to do. Some companies may use Python for one thing and NodeJS for another. Now what? The parsing will be different. Anytime there's a check and a use with different components, it's really hard to get correct. Another suggestion is to just error out when parsing fails. Things should NOT fail open. If syntax is wrong, a failure should occur. A final good one is just input validation. If you have a file name, only allow for alphabetic characters and an extension - nothing else. Good post!
Analysis Summary
# Research: The Minefield Between Syntaxes: Exploiting Syntax Confusion in the Wild
## Metadata
- **Authors:** Alex Brumen (aka Brumens)
- **Institution:** YesWeHack
- **Publication:** YesWeHack Blog / NahamCon 2025
- **Date:** October 17, 2025
## Abstract
This research explores "syntax confusion"—a class of vulnerabilities arising when multiple components in a system (browsers, proxies, frameworks, and libraries) interpret the same input differently due to inconsistent parsing rules. By identifying obscure syntax variations (such as named Unicode escapes, overlong port numbers, and extended header formats), the researcher demonstrates how to bypass security filters, achieve cache poisoning, and escalate Server-Side Request Forgery (SSRF) into full Local File Read (LFR).
## Research Objective
The research aims to identify lesser-known syntaxes in modern technologies and weaponize the resulting "parser differentials" to bypass security controls. It addresses the fundamental question: *How can disagreement between two or more parsers in an execution stack be leveraged to turn "sanitized" input into an exploit?*
## Methodology
### Approach
- **Differential Analysis:** Comparing how different languages (Python, PHP, Perl) and components (Browsers vs. Backends) process the same input string.
- **Specification Mining:** Deep-diving into RFCs and documentation (e.g., C Trigraphs, Unicode naming) to find "semantically equivalent" but syntactically different variants.
- **Weaponization:** Transforming quirky parser behaviors into documented exploit chains (SSRF, Cache Poisoning, LFR).
### Dataset/Environment
- Web application stacks involving Python (urllib), PHP (cURL and multipart parsers), and various browser engines.
- Real-world Bug Bounty scenarios where normalization differences exist between the edge (CDN/Proxy) and the origin (Application).
### Tools & Technologies
- Python (specifically `urlopen` and Unicode handling)
- PHP (cURL, `$_FILES`, and `$_POST` globals)
- HTTP Content-Disposition headers
- Burp Suite / Manual request manipulation
## Key Findings
### Primary Results
1. **Normalization Bypasses:** Small syntax deviations (like a leading space in a URL: `[space]http://...`) can bypass blocklist-based SSRF filters in Python (CVE-2023-24329).
2. **State-Inconsistent Normalization:** A "check-then-use" flaw exists where a security filter validates a normalized URL, but the backend processes an un-normalized version, leading to differential interpretation.
3. **Variable Injection via Multipart Confusion:** It is possible to redirect file contents from the restricted `$_FILES` global into the accessible `$_POST` variable in PHP by manipulating `Content-Disposition` quoting and naming.
### Novel Contributions
- **The "Overlong Port" Attack:** Demonstrating horizontal escalation where a server sees a port as invalid/normalizable (e.g., `:000123:443`) while a browser sees it as valid, leading to cache poisoning.
- **Blind SSRF to LFR Escalation:** A novel method using `@` symbols and `/tmp` paths in PHP cURL fetches to populate the `$_FILES` global, combined with parser confusion to extract that data.
## Technical Details
A significant portion of the research focuses on the **PHP Multipart Parser**. In a scenario where an attacker controls a URL passed to a PHP cURL request:
- The attacker uses the `@/path/to/file` syntax (common in older cURL) to force the local file into a multipart upload.
- Normally, the file content is trapped in `$_FILES`.
- By injecting a double quote into the name parameter (e.g., `name="username"; filename="...`), the researcher confused the parser into treating the file contents as the value of the `username` field in the `$_POST` global.
- Because the application returned the `username` value in the UI, a blind SSRF was converted into a full file read.
## Practical Implications
### For Security Practitioners
- **Don't Trust Normalization:** Assume that every "hop" in a network (Proxy -> App -> Database) sees a slightly different version of the data.
- **Test Semantical Equivalents:** When testing an input, try `param`, `param[]`, and extended encodings (UTF-8 binary vs. hex).
### For Defenders
- **Fail Closed:** If a parser encounters a syntax error or an "impossible" value (like a port with leading zeros), it should reject the request entirely rather than attempting to "fix" or normalize it.
- **Single Parser Architecture:** Where possible, use a single, well-defined library for parsing throughout the entire stack.
- **Strict Input Validation:** Use narrow allowlists (e.g., `[a-zA-Z0-9]`) rather than blocklists for filenames and URLs.
## Limitations
- **Environment Specificity:** Many exploits rely on specific versions of libraries (e.g., specific CVEs in Python or older PHP behaviors).
- **Control Requirements:** Some chains require control over specific headers or multiple parameters simultaneously, which may not be present in every application.
## Real-world Applications
- **Bypassing WAFs:** Using obscure Unicode escapes or C-style digraphs to hide payloads from signature-based detection.
- **Cache Poisoning:** Exploiting differences in how CDNs and Origin servers calculate cache keys versus how they process the URL port.
## Future Work
- Exploring syntax confusion in newer protocols like HTTP/3 and QUIC.
- Automating the discovery of parser differentials using "differential fuzzing" (comparing the output of two different parsers for the same random input).
## References
- CVE-2023-24329 (Python SSRF)
- RFC 7578 (Returning Values from Forms - Multipart/Form-Data)
- Related research: *The Absolute Arbitrary File Access* (via PHP cURL)
- Defanged URL Example: `hXXp://example[.]com:000123:443`