Cloudflare outage on November 18, 2025

Full Report

Recently, Cloudflare had an outage. This is a post-mortem explaining the issue. They initially thought it was a hyperscale DDoS attack, but it wasn't a malicious cyber attack of any kind. Cloudflare uses all requests through the Bot Management infrastructure, which uses machine learning to generate bot scores for every request. Customers control which bots are allowed to access sites. There is a feature configuration file that is used by the machine model. This file is refreshed every few minutes and published to the entire network. A recent change to the ClickHouse query behaviour made it so that there were duplicate feature rows. The query for this information did not filter by a database name. When a new database with the same name was added, multiple values were being queried. This changed the file size, causing some of the bot's modules to error. The Bot Management preallocates memory. It has a limit of 60 features, but this change led to the usage of about 200 features. The features parsing code, written in Rust, contains an unwrap() that will cause the program to panic when there are more than 60 features. Bad! A good post-mortem. Cloudflare claims this is the first time most web traffic has gone down since 2019. It took about 2 hours to fix it for most people, which is a pretty great turnaround time!

Analysis Summary