Full Report
Wiz Research found a data exposure incident on Microsoft’s AI GitHub repository, including over 30,000 internal Microsoft Teams messages – all caused by one misconfigured SAS token
Analysis Summary
# Incident Report: Accidental Exposure of 38TB of Private Data via Misconfigured Azure SAS Token
## Executive Summary
Microsoft's AI research team accidentally exposed 38 terabytes of private data, including employee workstation backups containing secrets and internal communications, by including an overly permissive, non-expiring Azure Shared Access Signature (SAS) token in a public GitHub repository intended for open-source models. The incident highlights significant risks associated with managing large datasets in cloud environments, particularly within AI/ML workflows, where simple misconfigurations can lead to massive data leaks. The issue was detected and reported by external researchers, leading to rapid token invalidation and subsequent internal investigation.
## Incident Details
- **Discovery Date:** June 22, 2023
- **Incident Date (Initial Exposure):** July 20, 2020 (When token was first committed)
- **Affected Organization:** Microsoft (AI research division)
- **Sector:** Technology / AI & Cloud Services
- **Geography:** Global (Data hosted on Azure, researchers varied)
## Timeline of Events
### Initial Access (Exposure Creation)
- **Date/Time:** July 20, 2020
- **Vector:** Misconfiguration of an Azure Shared Access Signature (SAS) token used to share data for an open-source AI model repository (`robust-models-transfer`).
- **Details:** An Account SAS token was committed to the public GitHub repository. The token was configured to grant "full control" permissions across the **entire Azure Storage Account**, not just the intended data file. The token's expiry was initially set to October 5, 2021.
### Configuration Escalation
- **Date/Time:** October 6, 2021
- **Vector:** Manual token renewal with increased lifetime.
- **Details:** The SAS token's expiry date was updated to October 6, 2051, effectively creating a non-expiring, full-control credential for the entire storage account.
### Discovery & Reporting
- **Date/Time:** June 22, 2023
- **Vector:** External scanning of the internet for misconfigured storage containers by the Wiz Research Team.
- **Details:** Wiz Research detected the overly permissive SAS link embedded in the Microsoft GitHub repository, revealing 38TB of private data. The issue was reported to MSRC.
### Detection & Response
- **Date/Time:** June 24, 2023
- **Vector:** Security team intervention based on external report.
- **Details:** Microsoft invalidated the compromised SAS token two days after receiving the report.
- **Date/Time:** July 7, 2023
- **Details:** The SAS token in the GitHub repository was replaced.
- **Date/Time:** August 16, 2023
- **Details:** Microsoft completed its internal investigation into the scope of the potential impact.
- **Date/Time:** September 18, 2023
- **Details:** Public disclosure of the incident.
## Attack Methodology
This was not a result of an active attacker compromise, but rather a severe misconfiguration leading to **accidental exposure**. If an external party had discovered the token first, the methods would align with:
- **Initial Access:** Leveraging a publicly exposed, high-privilege, non-expiring external sharing credential (SAS token).
- **Discovery/Collection:** Direct access to the entire storage account contents, including backups and internal communications.
- **Impact:** Potential for arbitrary code execution via model files formatted with `pickle` if users downloaded and ran pre-maliciously modified AI models from the exposed source.
## Impact Assessment
- **Financial:** Not explicitly quantified in the summary, but potential costs related to forensic investigation and remediation.
- **Data Breach:** 38 TB of private data exposed, including:
- Disk backups of two employee workstations.
- Secrets, private keys, and passwords for Microsoft services.
- Over 30,000 internal Microsoft Teams messages from 359 employees.
- **Operational:** Minimal direct operational disruption noted, as the scope was limited to the storage account, though the exposure presented a high internal risk.
- **Reputational:** Negative press regarding internal security posture concerning AI development data handling practices.
## Indicators of Compromise
As the exposure was via a misconfigured link rather than active intrusion, IoCs focus on the credential used:
- **Network/Authentication Indicators:** Presence of traffic or use requests against the Azure Storage Account associated with the **misconfigured Account SAS token**.
- **File Indicators:** Access/download activity related to files within the storage container named `robustnessws4285631339` (or similar) that were not intended for public distribution.
- **Behavioral Indicators:** Execution of downloaded `.ckpt` (TensorFlow checkpoint) files from the repository by external users if the files had been maliciously altered prior to detection.
## Response Actions
- **Containment:** Invalidating/revoking the overly permissive SAS token immediately upon discovery (June 24, 2023).
- **Eradication:** Replacing the compromised SAS token with a securely configured one (July 7, 2023).
- **Recovery:** Finalizing the internal investigation into the extent of potential compromise/access by internal users (August 16, 2023).
## Lessons Learned
- **SAS Token Risk:** Account SAS tokens configured with "full control" permissions across an entire storage account, especially when set to never expire, are functionally equivalent to an account key and are unsafe for external sharing.
- **AI Data Handling:** The rapid pace of AI development requires additional security scrutiny, especially when handling large volumes of training data, as data scientists may use insecure sharing methods for convenience.
- **Visibility Gap:** Client-side generation of highly permissive, non-expiring SAS tokens creates a critical blind spot where administrators have no visibility or easy revocation mechanism outside of rotating the underlying account key.
- **Supply Chain Risk:** Sharing AI model training elements necessitates security vetting to prevent the inadvertent inclusion of malicious code execution vectors (e.g., using `pickle` format).
## Recommendations
1. **Restrict SAS Usage:** Strongly discourage or prohibit the use of Account SAS tokens for external sharing. Where necessary, use them sparingly with the narrowest possible scope (specific container/file) and strict expiry times.
2. **Mandate Least Privilege:** Ensure all links created for data sharing enforce Read-Only access unless absolutely essential for the task.
3. **Enhance DLP/Monitoring for AI Teams:** Provide security teams with greater visibility and auditing capabilities over the data sharing and storage access mechanisms used by Data Science and AI research teams.
4. **Secure ML Artifacts:** Phase out insecure serialization formats like Python's `pickle` for sharing model artifacts publicly, favoring safer alternatives when possible, especially when sharing outside trusted environments.
5. **Centralized Credential Management:** Implement centralized control mechanisms to govern and log the creation of high-privilege access tokens, ensuring they are stored securely, not directly in source code repositories.