Full Report
AWS had its worst outage in 10 years. It lasted 16 hours and took down 140 services including EC2. The author of this post doesn't work at Amazon. But, they are an experience developer who took information from several locations and aggregated it together. DynamoDB had a service failure at about 7am. Since AWS dogfoods much of their stuff internally. So, when DynamoDB went down so did literally everything else that was using it. So, what was the issue? A race condition in DNS registration. DynamoDB has an automated DNS load balancing system. They have have three DNS Enactors (adder of rules) that operate in the three availability zones. These don't have any coordination. The exact us-east-1a was going very, very slow - likely 10x-100x slower than normal. The DNS service gets rid of old plans in an automated way. This is done via "keep last N" strategy. The DNS plan that was being enacted fell outside of the safety of N. This meant that an active DNS plan was removed! This time of check vs. time of use issue led to everything going down. The system didn't have a fallback to fix itself once the DNS broke. There's two issues here: a TOCTOU bug and a missing check for a stale plan being active or not. They mention the Swiss Cheese model: the more holes that are in the system, the more likely something is too happen. Much of the time, several things need to go wrong. This outage lasted about 3 hours but more damage was caused. The EC2 service uses DynamoDB for metadata management. So, current instances could keep running but you couldn't create new ones or stop current ones. Once DynamoDB came back up, EC2 still didn't work. The DropletWorkflow Manager (DWFM) contains a large list of active leases (10^6) of which 10^2 are broken, meaning that a connection needs to be reestablished. The heartbeat timeouts raised to 10^5, creating a gigantic queue that led to a congestive collapse. This was only fixed by manual intervention. The author claims that software is much buggier than we realize. If AWS, the giant of the cloud industry, has these types of issues laying around then there are many more to be discovered. Overall, a good post on the outage.
Analysis Summary
# Incident Report: AWS us-east-1 Regional Outage (October 20th)
## Executive Summary
A latent "Time-of-Check to Time-of-Use" (TOCTOU) race condition in DynamoDB’s automated DNS management system triggered a massive 14-hour outage in the us-east-1 region. The failure of DynamoDB, a foundational service, caused a cascading "congestive collapse" that disabled 140 AWS services, including EC2 and NLB. The incident was only resolved through manual intervention to clear massive internal request queues within the EC2 control plane.
## Incident Details
- **Discovery Date:** October 20th, 06:48 AM UTC
- **Incident Date:** October 20th
- **Affected Organization:** Amazon Web Services (AWS)
- **Sector:** Cloud Computing / Infrastructure as a Service (IaaS)
- **Geography:** us-east-1 Region (North Virginia, USA)
## Timeline of Events
### Initial Access
- **Date/Time:** 06:48 AM UTC
- **Vector:** Internal System Logic Failure (Race Condition)
- **Details:** A DNS Enactor in one Availability Zone (us-east-1a) became extremely slow. Due to a lack of coordination between Enactors, a garbage collection process ("keep last N") deleted an active DNS plan before the slow Enactor could finish using it.
### Lateral Movement
- **Mechanism:** Internal Service Dependencies (Dogfooding)
- **Details:** As DynamoDB is a "Layer 1" foundational service, its failure immediately propagated to EC2 (which uses DynamoDB for metadata) and subsequently to 140 other services that depend on EC2 or DynamoDB directly.
### Data Exfiltration/Impact
- **Operational Damage:** 140 services degraded or offline.
- **Specifics:** Users could not create new EC2 instances or stop current ones. Revenue loss for AWS is estimated in the eight-figure range.
### Detection & Response
- **Discovery:** Automated monitoring flagged DynamoDB service failure at 06:48 AM UTC.
- **Response Actions:** Engineers identified a congestive collapse in the DropletWorkflow Manager (DWFM). Recovery required manual intervention to manage a backlog of over 1,000,000 active leases and a massive queue of heartbeat timeouts ($10^5$).
## Attack Methodology
*Note: This was a non-adversarial system failure, not an external attack.*
- **Impact:** Systemic collapse via resource exhaustion and logic error.
- **Root Cause:** TOCTOU (Time-of-Check to Time-of-Use) bug in DNS plan enactment.
- **Failure Mode:** "Swiss Cheese Model"—multiple layers of safety (redundancy, independent AZs, garbage collection limits) failed simultaneously due to extreme latency and a lack of cross-zone coordination.
## Impact Assessment
- **Financial:** Estimated eight-figure revenue reduction.
- **Data Breach:** None (Availability failure, not a Confidentiality breach).
- **Operational:** 14-hour total duration; near-total paralysis of the us-east-1 control plane.
- **Reputational:** Significant; impacted the perceived reliability of AWS's oldest and largest region.
## Indicators of Compromise
- **Behavioral Indicators:**
- DNS queries for `dynamodb.us-east-1.amazonaws[.]com` returning empty records (`[]`).
- Extreme latency (10x-100x) in regional DNS Enactor processes.
- Massive spike in `DWFM` (DropletWorkflow Manager) heartbeat timeouts.
## Response Actions
- **Containment:** Automated DNS repairs failed, necessitating manual oversight.
- **Recovery:** Manual clearing of work queues within the EC2 control plane to resolve the congestive collapse.
- **Verification:** System was monitored until DNS propagation stabilized across all three AZs.
## Lessons Learned
- **The "Dogfooding" Trap:** High-level services (DynamoDB) serving as dependencies for low-level services (EC2) creates circular dependencies that can prevent self-healing.
- **Race Conditions:** Even highly mature systems are susceptible to TOCTOU bugs if automated cleanup strategies ("keep last N") do not verify the active status of the resource being deleted.
- **Coordination vs. Resilience:** AWS intentionally avoided coordination between DNS Enactors for resilience, but this lack of a "locking" mechanism allowed the race condition to occur.
## Recommendations
- **Coordination Mechanisms:** Implement distributed locking or fencing tokens for critical DNS mutations to prevent conflicting operations.
- **State Verification:** Update garbage collection logic to explicitly check if a plan is "Active" rather than relying on a numerical buffer ("Last N").
- **Backpressure Management:** Implement better circuit breakers in the DropletWorkflow Manager to prevent congestive collapse when a dependency (like DynamoDB) returns.
- **Regional Diversification:** Customers should avoid over-reliance on us-east-1 due to its unique complexity and historical outage frequency.