Full Report
The authors of this post are porting significant amount of networking code in EdgeDB from Python to Rust. While doing this, they have ran into a lot of interesting issues, including this post. While trying the port, they noticed that it always failed on ARM64 CI runners but nothing else. The CI runner appeared to hang for a while then stop. Upon logging onto the CI box, they noticed that the program had actually crashed but this was detected by the runner. They noticed a coredump, which indicated something weird had happened. They loaded the coredump into GDB and noticed that it was a crash within the Rust getenv() function. The function is crashing when loading a byte from environment variables. It was attempting to load data from an invalid memory location. Why is Libc crashing!? One of their co-workers dropped a line: getenv isn't threadsafe. From looking at the crash dump, it was clear that while this process was reading the environment variables, another one had write to them. In all likelihood, the memory safe for the env vars was too small so it was reallocated to be bigger. However, the other code was still reading from this. The variables associated with the crash were in OPEN SSL. In their code, they were using openssl to probe for packets, which was the offending code. Since they are using a combination of Python and Rust, Rust didn't think that an unsafe operation was happening. To fix the bug, they moved from rust-native-tls and used the rustls instead. By calling try_init_ssl_cert_env_vars from Python, a global lock would prevent this race condition. Looking forward, Rust is marking the environment-setter functions unsafe and glibc has tried making getenv more thread-safe. Why does this only happen on ARM? The crash occurs in a call to realloc within setenv. To hit this code path, the environmental variables need to line up just write for the realloc to cause issues in getenv(). Given this information, they're pretty lucky that they found this at all. Personally, a really good read. Learning about debugging techniques and interesting bugs is fun!
Analysis Summary
# Research: C stdlib isn't threadsafe and even safe Rust didn't save us
## Metadata
- **Authors**: Matt Mastracci, Michael J. Sullivan
- **Institution**: Gel (formerly EdgeDB)
- **Publication**: Gel Blog
- **Date**: January 22, 2025
## Abstract
This technical analysis details a non-deterministic crash occurring during the porting of EdgeDB’s networking stack from Python to Rust. The investigation reveals a critical thread-safety vulnerability in the C standard library’s environment management functions (`getenv`/`setenv`). Despite Rust’s memory safety guarantees, the interaction between Rust, Python, and shared C libraries (specifically OpenSSL) bypassed these safeguards, leading to a Use-After-Free (UAF) condition triggered by environment variable reallocation on ARM64 Linux systems.
## Research Objective
The research addresses why a seemingly stable Rust/Python hybrid application experienced intermittent, silent crashes and apparent deadlocks specifically on ARM64 CI runners, and investigates the underlying thread-safety of the C standard library in multi-language environments.
## Methodology
### Approach
- **Empirical Debugging**: Utilizing AWS SSM to access uncontainerized ARM64 nodes for real-time process monitoring.
- **Post-mortem Analysis**: Examining core dumps using GDB to identify the exact instruction and memory address causing the crash.
- **Source Code Auditing**: Tracing calls through the `reqwest`, `openssl`, and `openssl-probe` crates.
- **Architectural Analysis**: Comparing memory models (TSO vs. Weakly-ordered) and glibc implementation details across x86_64 and ARM64.
### Dataset/Environment
- **Hardware**: AWS ARM64 (Graviton) CI runners and x86_64 local/CI environments.
- **Software Stack**: Rust (utilizing `reqwest` and `native-tls`), Python (EdgeDB core), and `glibc`.
- **Target OS**: Linux (Ubuntu-based containers).
### Tools & Technologies
- **GDB**: For coredump analysis.
- **Docker**: For environment replication.
- **AWS SSM**: For direct hardware access.
- **Rust Compiler**: Versioning considerations for the 2024 Edition safety changes.
## Key Findings
### Primary Results
1. **`getenv` is Not Thread-Safe**: The C standard library’s environment functions are inherently thread-unsafe. A `setenv` call can trigger a `realloc`, moving the environment array to a new memory location while another thread is reading the old location via `getenv`.
2. **FFI Safety Gap**: Rust’s safety guarantees do not extend to external C libraries. While Rust locks its own environment access, it cannot prevent a C library (like OpenSSL) from calling `getenv` concurrently with a Rust `setenv`.
3. **Platform Sensitivity**: The crash was prevalent on ARM64 due to how `glibc` handles memory alignment and `realloc`. Specific memory patterns (writing `0x220` over a null terminator) created "invalid pointer landmines" unique to the ARM64 execution environment.
### Supporting Evidence
- **Coredump Analysis**: Pointed to a crash within `libc.so.6` during a byte-load operation from an address that had been previously freed and overwritten with a metadata value (`0x220`).
### Novel Contributions
- Identification of the specific interaction between `openssl-probe` and modern Rust async runtimes that triggers this race condition.
- Detailed explanation of why weak memory models and specific `glibc` heap metadata layouts make this bug more reproducible on ARM64 than x86.
## Technical Details
The crash occurred because `openssl-probe` (used by `native-tls`) calls `getenv` to find SSL certificates. Simultaneously, another part of the system (or the Rust runtime) might trigger a change in environment variables. In `glibc`, if the environment list grows, `realloc` is called. The pointers held by a concurrent `getenv` call become dangling. On ARM64, the "free" slots in memory were being tagged with values that looked like valid pointers but pointed to unmapped or protected memory, causing a SIGSEGV.
## Practical Implications
### For Security Practitioners
- **Inter-process Safety**: Recognize that "Safe" languages are only as safe as their FFI boundaries. Data shared with C-based libraries remains vulnerable to classic memory corruption.
- **Environment Manipulation**: Modifying environment variables in multi-threaded applications is a high-risk operation.
### For Defenders
- **Library Swapping**: Replacing `native-tls` (which relies on system C OpenSSL) with a pure-language implementation like `rustls` eliminates the dependency on thread-unsafe C library calls.
- **Initialization Control**: Ensure all environment-related probing is performed during a single-threaded "setup" phase before spawning workers.
### For Researchers
- **Cross-Language Analysis**: Further study is needed on how language-specific safety runtimes (Rust) interact with "the world" (POSIX/C).
## Limitations
- The bug is highly dependent on the version of `glibc` and the specific memory layout of the environment variables.
- Reproduction is non-deterministic, making it difficult to catch in standard unit testing without high-concurrency CI.
## Comparison to Prior Work
While the thread-unsafety of `setenv` is known in the C community, this research highlights how modern memory-safe languages like Rust provide a false sense of security when they abstract away these low-level C behaviors via dependencies like OpenSSL.
## Real-world Applications
- **Infrastructure Porting**: Critical guidance for teams migrating Python/C++ stacks to Rust.
- **System Hardening**: Validates the move by the Rust project to mark `setenv` as `unsafe` in future editions.
## Future Work
- **Rust 2024 Edition**: Implementation of `unsafe` markers for environment manipulation.
- **Glibc Updates**: Monitoring the adoption of recent `glibc` commits that trade memory leaks for thread safety by avoiding `realloc` for environments.
## References
- [17] [Rust Issue #124866: Mark set_var as unsafe](https://github.com/rust-lang/rust/issues/124866)
- [18] [Glibc commit 7a61e7f: Thread-safe getenv/setenv](https://github.com/bminor/glibc/commit/7a61e7f557a97ab597d6fca5e2d1f13f65685c61)