Full Report
Cloudflare has confirmed that the massive service outage yesterday was not caused by a security incident and no data has been lost. [...]
Analysis Summary
# Incident Report: Cloudflare Service Outage Due to KV Storage Failure
## Executive Summary
Cloudflare experienced a widespread service outage affecting numerous core services, including Workers, AI, and Realtime, due to a failure in the Workers Key-Value (KV) storage backend. Contrary to initial concerns, the incident was confirmed not to be a security breach, and customer data was reported as safe. Response efforts focused on mitigating the internal storage dependency, leading to plans to migrate KV backends from a third-party provider to Cloudflare's internal R2 storage solution.
## Incident Details
- Discovery Date: Unknown (Implied concurrent with the outage)
- Incident Date: Unknown (Date context refers to a widespread outage that occurred)
- Affected Organization: Cloudflare
- Sector: Content Delivery Network (CDN) / Internet Infrastructure Services
- Geography: Global
## Timeline of Events
### Initial Access
* **Date/Time:** N/A (This was an internal infrastructure failure, not an intrusion)
* **Vector:** Internal infrastructure failure impacting their Workers Key-Value (KV) storage.
* **Details:** A failure in the KV storage layer caused cascading failures across dependent services.
### Lateral Movement
* **Details:** Not applicable; the incident was a failure within the core storage system, causing services relying on that storage to fail or operate incorrectly.
### Data Exfiltration/Impact
* **Details:** No data exfiltration occurred as the incident was an operational failure, not a breach. The impact was service unavailability (up to 100% failure rates for some services).
### Detection & Response
* **How it was discovered:** Monitoring systems detected widespread service degradation and high failure rates across dependent Cloudflare products.
* **Response actions taken:** Cloudflare engineers worked to restore services and immediately prioritized long-term resilience changes to mitigate the dependency on the third-party KV backend.
## Attack Methodology
- Initial Access: Infrastructure failure (No threat actor involved).
- Persistence: N/A
- Privilege Escalation: N/A
- Defense Evasion: N/A
- Credential Access: N/A
- Discovery: N/A
- Lateral Movement: N/A
- Collection: N/A
- Exfiltration: N/A
- Impact: Service failure due to reliance on a single point of failure (KV storage).
## Impact Assessment
- Financial: Not disclosed, but likely significant due to global service disruption.
- Data Breach: Zero customer data breach confirmed.
- Operational: Severe. Services like Workers AI & AutoRAG were completely unavailable. Durable Objects, D1, and Queues suffered up to 22% error rates or complete unavailability. CDN and Workers builds experienced increased latency and 100% failure for new builds.
- Reputational: Significant public attention due to the widespread nature of the outage impacting a critical internet infrastructure provider.
## Indicators of Compromise
- **Network indicators:** N/A (Internal storage failure)
- **File indicators:** N/A
- **Behavioral indicators:** Widespread HTTP 5XX errors across Cloudflare-dependent services.
## Response Actions
- **Containment measures:** Engineers focused on stabilizing the affected components dependent on the failing KV service.
- **Eradication steps:** Did not involve cleaning malicious code, but rather isolating or bypassing the failing storage layer where possible.
- **Recovery actions:** Gradually restoring services by addressing the root cause (KV storage failure) and implementing short-term fixes.
## Lessons Learned
- **Key takeaways:** Over-reliance on a single third-party cloud provider for critical internal functions (like Workers KV backend storage) creates a massive single point of failure.
- **What could have been done better:** Not having immediate internal redundancy for the KV backend to prevent cascading failures across core services.
## Recommendations
- **Prevention measures for similar incidents:** Accelerate the migration of the Workers **KV** central store from the third-party cloud provider to Cloudflare’s own **R2** object storage to eliminate external dependency.
- Implement cross-service safeguards to prevent traffic surges from overwhelming recovering systems during storage outages.
- Develop new tooling to allow for gradual service restoration during storage outages, reducing immediate system re-spikes.