Full Report
Recently, Cloudflare had an outage. This is a post-mortem explaining the issue. They initially thought it was a hyperscale DDoS attack, but it wasn't a malicious cyber attack of any kind. Cloudflare uses all requests through the Bot Management infrastructure, which uses machine learning to generate bot scores for every request. Customers control which bots are allowed to access sites. There is a feature configuration file that is used by the machine model. This file is refreshed every few minutes and published to the entire network. A recent change to the ClickHouse query behaviour made it so that there were duplicate feature rows. The query for this information did not filter by a database name. When a new database with the same name was added, multiple values were being queried. This changed the file size, causing some of the bot's modules to error. The Bot Management preallocates memory. It has a limit of 60 features, but this change led to the usage of about 200 features. The features parsing code, written in Rust, contains an unwrap() that will cause the program to panic when there are more than 60 features. Bad! A good post-mortem. Cloudflare claims this is the first time most web traffic has gone down since 2019. It took about 2 hours to fix it for most people, which is a pretty great turnaround time!
Analysis Summary
# Incident Report: Cloudflare Bot Management Configuration Outage
## Executive Summary
Cloudflare experienced a significant global outage caused by a software failure within its Bot Management infrastructure rather than a cyber attack. A database query error resulted in an oversized configuration file that exceeded preallocated memory limits, causing the Rust-based parsing engine to panic and crash. The incident resulted in a widespread disruption of web traffic for approximately two hours before service was restored.
## Incident Details
- **Discovery Date:** Not explicitly stated (Initial suspicion of DDoS)
- **Incident Date:** Recent (Post-mortem date)
- **Affected Organization:** Cloudflare
- **Sector:** Technology / Content Delivery Network (CDN) & Security
- **Geography:** Global
## Timeline of Events
### Initial Access
- **Date/Time:** N/A (Internal Configuration Change)
- **Vector:** Internal Database Query Logic Error
- **Details:** A change in ClickHouse query behavior failed to filter by database name. When a new database with an identical name was added, the query returned duplicate rows, ballooning the feature configuration file.
### Lateral Movement
- **N/A:** This was a non-malicious system failure. The "movement" in this context refers to the automated distribution of the corrupted configuration file across the entire global network.
### Data Exfiltration/Impact
- **System Failure:** The configuration file size increased from the expected 60 features to approximately 200 features. This exceeded the 60-feature limit preallocated in the Bot Management module.
### Detection & Response
- **Detection:** Systems began failing as the `unwrap()` function in the Rust code triggered a program panic upon encountering the unexpected number of features.
- **Initial Assessment:** Analysts initially misidentified the surge in errors/dropped traffic as a hyperscale DDoS attack.
- **Resolution:** The root cause was identified in the configuration generation logic; the fix was deployed globally within 2 hours.
## Attack Methodology
*Note: This incident was a technical failure, not a malicious attack. The following fields reflect the technical "kill chain" of the failure.*
- **Initial Access:** Automated refresh of a feature configuration file via a flawed ClickHouse query.
- **Persistence:** The corrupted file was published to the entire network every few minutes.
- **Impact:** Memory exhaustion and code panic. The Rust `unwrap()` call led to service termination when the array bounds were exceeded.
## Impact Assessment
- **Financial:** High (Implicit, given Cloudflare's scale and SLA commitments).
- **Data Breach:** None.
- **Operational:** Massive global business disruption; most web traffic routed through Cloudflare was inaccessible.
- **Reputational:** Significant, as this was the first major global outage for the provider since 2019.
## Indicators of Compromise
- **Network indicators:** Global drop in successful HTTP/HTTPS request processing.
- **File indicators:** Bot Management configuration file exceeding 60 feature rows (~200 features detected).
- **Behavioral indicators:** Rust "panic" logs in Bot Management modules related to feature parsing.
## Response Actions
- **Containment:** Identification that the issue was internal rather than an external DDoS.
- **Eradication:** Rollback/Correction of the ClickHouse query logic to ensure database name filtering and removal of duplicate rows.
- **Recovery:** Global deployment of the corrected configuration file; services restored in ~2 hours.
## Lessons Learned
- **Implicit Error Handling:** Over-reliance on `unwrap()` in Rust can turn a minor data inconsistency into a catastrophic system crash.
- **Input Validation:** Systems should validate the schema and size of configuration files before global distribution.
- **Query Specificity:** Database queries used for critical infrastructure should always use explicit filters (like database names) to prevent collisions during schema changes.
## Recommendations
- **Code Audit:** Replace high-risk `unwrap()` calls with proper error handling (e.g., `match` or `if let`) to allow for graceful degradation rather than total crashes.
- **Pre-distribution Testing:** Implement a "canary" deployment for configuration files where they are validated against a staging environment before hitting the production global network.
- **Monitoring:** Add alerts for significant deviations in configuration file sizes or row counts.