Full Report
Cloudflare has announced that its R2 object storage and dependent services experienced an outage lasting 1 hour and 7 minutes, causing 100% write and 35% read failures globally. [...]
Analysis Summary
# Incident Report: Cloudflare R2 Service Outage Due to Password Rotation Error
## Executive Summary
Cloudflare experienced a significant service outage affecting its R2 storage, Cache Reserve, Images, Stream, and other related services due to a human error during a credential rotation process. The rotation led to a failure in authenticating with the R2 storage infrastructure, causing widespread 100% write failures and partial read failures across impacted services. Cloudflare contained the issue by identifying the incorrect token usage and updated procedures to prevent recurrence.
## Incident Details
- **Discovery Date:** Not explicitly stated, but occurred during the credential rotation process.
- **Incident Date:** Not explicitly stated, but related to a recent credential rotation event.
- **Affected Organization:** Cloudflare
- **Sector:** Cloud Services / Infrastructure
- **Geography:** Global
## Timeline of Events
### Initial Access
- **Date/Time:** During credential rotation maintenance.
- **Vector:** Human operational error during password/credential rotation.
- **Details:** Cloudflare rotated credentials, but the propagation of the *previous* credential deletion to the storage infrastructure caused authentication issues.
### Lateral Movement
- Not applicable. This was an internal operational failure, not a malicious intrusion.
### Data Exfiltration/Impact
- **Details:** No customer data loss or corruption occurred. The primary impact was service degradation:
- R2: 100% write failures, 35% read failures (cached objects accessible).
- Cache Reserve: Increased origin traffic due to failed reads.
- Images and Stream: All uploads failed; Stream delivery degraded by 94%.
- Email Security, Vectorize, Log Delivery, Billing, Key Transparency Auditor: Various levels of degradation.
### Detection & Response
- **How it was discovered:** A delay in initial discovery occurred because checks relied on availability metrics instead of explicitly validating which token was being used by the R2 Gateway service to authenticate with R2's storage infrastructure.
- **Response actions taken:** Acknowledged the reliance on outdated metrics, identified the failure in token verification, and implemented remediation steps.
## Attack Methodology
This incident was **not a cyber attack** but an operational failure. Therefore, most standard ATT&CK categories are not applicable.
- **Initial Access:** Operational error during credential management.
- **Persistence:** N/A
- **Privilege Escalation:** N/A
- **Defense Evasion (Internal):** Failure to validate the correct token configuration.
- **Credential Access:** N/A
- **Discovery:** N/A
- **Lateral Movement:** N/A
- **Collection:** N/A
- **Exfiltration:** N/A
- **Impact:** Service unavailability and degradation due to failed authentication calls to storage resources.
## Impact Assessment
- **Financial:** Not specified, but operational downtime costs are implied.
- **Data Breach:** None. No customer data loss or corruption reported.
- **Operational:** Significant service degradation impacting R2, Images, Stream, and associated services across all regions.
- **Reputational:** Potential short-term reputational damage due to high-profile service degradation. (Note: The article also mentioned a prior, separate, 1-hour outage in February caused by human error blocking an entire service instead of a malicious URL.)
## Indicators of Compromise
N/A - This was a configuration/operational error, not a result of malicious threat actor activity.
## Response Actions
- **Containment measures:** Implicitly, stopping the faulty rotation context or quickly reverting/correcting the credentials being presented by the R2 Gateway service.
- **Eradication steps:** Identified the logic failure where the old credential set was still being referenced or where new credentials were not successfully propagated/validated.
- **Recovery actions:** Verified the correct token was being used by the R2 Gateway service to authenticate with storage infrastructure.
## Lessons Learned
- **Key takeaways:** Over-reliance on availability metrics during high-impact operational changes (like credential rotation) is insufficient. Explicit validation of the active credentials/tokens is critical.
- **What could have been done better:** Cloudflare noted they should have explicitly validated *which token* was being used by the R2 Gateway service, rather than relying on standard availability metrics after updating the old credential set.
## Recommendations
- **Prevention measures for similar incidents:** Implement improved and enhanced credential logging and verification specifically for critical infrastructure authentication processes.
- Mandate the use of automated deployment tooling to minimize human error in high-risk configuration changes.
- Update Standard Operating Procedures (SOPs) to require dual validation (or multi-party approval) for high-impact actions like credential rotation.
- Enhance health checks to ensure faster, root-cause detection when authentication tokens fail.