Full Report
American cybersecurity company SentinelOne revealed over the weekend that a software flaw triggered a seven-hour-long outage on Thursday. [...]
Analysis Summary
# Incident Report: SentinelOne 7-Hour Platform Outage Due to Software Flaw
## Executive Summary
SentinelOne experienced a 7-hour outage affecting programmatic access to its services, including customer management consoles, due to a software flaw in a control system transitioning to a new cloud architecture. The flaw caused an outgoing control system to overwrite network settings by restoring an empty backup of the AWS Transit Gateway route table, leading to widespread service interruption while endpoint protection remained active. Response actions focused on restoring the correct configuration state and migrating fully to the new Infrastructure-as-Code (IaC) environment.
## Incident Details
- **Discovery Date:** May 29 (Date of Outage)
- **Incident Date:** May 29 (Occurred during outage window)
- **Affected Organization:** SentinelOne
- **Sector:** Cybersecurity / Managed Detection and Response (MDR)
- **Geography:** Not explicitly disclosed, but involving AWS cloud infrastructure.
## Timeline of Events
### Initial Access
- **Date/Time:** Prior to May 29 (Relates to system deployment/migration)
- **Vector:** Internal Software/Configuration Error during cloud migration.
- **Details:** A soon-to-be-deprecated outgoing control system was triggered by the creation of a new account within the ongoing transition to an IaC production system.
### Lateral Movement
*Not applicable. This was an infrastructure failure, not a traditional adversary intrusion.*
### Data Exfiltration/Impact
- **Impact:** 7-hour outage resulting in loss of programmatic access to company services. Unified Asset Management/Inventory and Identity services were down. Customers could not view vulnerabilities or access identity consoles. Data ingestion from third-party services and MDR alerts were potentially impacted.
### Detection & Response
- **How it was discovered:** The immediate operational impact (service failure) signaled the issue.
- **Response actions taken:** The team identified that the outgoing control system had wrongly restored an empty AWS Transit Gateway route table. They initiated the process of restoring the correct configuration from a valid backup.
## Attack Methodology
- **Initial Access:** Flaw activated within a control system governing AWS network configurations during a migration phase.
- **Persistence:** N/A (System configuration error)
- **Privilege Escalation:** N/A
- **Defense Evasion:** N/A
- **Credential Access:** N/A
- **Discovery:** N/A
- **Lateral Movement:** N/A
- **Collection:** N/A
- **Exfiltration:** N/A
- **Impact:** Network connectivity failure caused by the route table overwriting in the AWS Transit Gateway.
## Impact Assessment
- **Financial:** Not disclosed.
- **Data Breach:** No indication of external data breach; impact was on service availability.
- **Operational:** Significant interruption (7 hours) to customer access to management portals (Unified Asset Management, Identity Consoles) and backend data processing (MDR alerts). Endpoints remained protected.
- **Reputational:** Required public communication to explain the cause and duration of the outage.
## Indicators of Compromise
- **Network indicators - defanged:** Overwritten/empty AWS Transit Gateway route table configuration.
- **File indicators:** N/A
- **Behavioral indicators:** Control system incorrectly applying configurations based on an outdated source of truth during a state comparison operation.
## Response Actions
- **Containment measures:** The immediate containment involved isolating or disabling the flawed outgoing control system.
- **Eradication steps:** Determining the false positive configuration comparison that triggered the overwrite.
- **Recovery actions:** Restoring the correct network settings via restoration of the AWS Transit Gateway route table from a valid backup.
## Lessons Learned
- The primary lesson learned centered on configuration management during major infrastructure transitions. The "soon-to-be-deprecated (i.e. outgoing) control system" was incorrectly designated as the source of truth, leading it to overwrite established settings based on its outdated configuration state.
- Reliance on deprecated control systems during an IaC rollout poses risks.
## Recommendations
- Rigorously validate "sources of truth" for network and configuration states before decommissioning legacy control systems.
- Ensure that final state validation logic within new IaC systems properly validates against current, active production states, not against outdated reference configurations.
- Accelerate the full transition to the new cloud architecture based on Infrastructure-as-Code principles to eliminate reliance on legacy control mechanisms.