Full Report
Data exposed even briefly can live on in generative AI chatbots long after the data is made private. © 2024 TechCrunch. All rights reserved. For personal use only.
Analysis Summary
# Vulnerability: Data Persistence in Microsoft Copilot from Temporarily Public GitHub Repositories
## CVE Details
- CVE ID: N/A (This describes a configuration/indexing issue, not a traditional software vulnerability, hence no formal CVE is cited in the source.)
- CVSS Score: Unknown
- CWE: N/A
## Affected Systems
- Products: Microsoft Copilot (via indexing of publicly accessible GitHub content)
- Versions: Any version of Copilot dependent on cached and indexed public data.
- Configurations: GitHub repositories that were made public, even briefly (e.g., mistakenly), during the indexing period by Microsoft's Bing search engine.
## Vulnerability Description
Data, including source code and sensitive information, from thousands of GitHub repositories that were momentarily exposed to the public (before being set back to private or deleted) remains accessible via prompts to Microsoft Copilot. This is due to Microsoft's Bing search engine indexing and caching the public data while it was accessible. When users query Copilot, the model can surface this cached, stale data, effectively leaking information from repositories users believed were secured or deleted. Over 20,000 such repositories affecting more than 16,000 organizations were identified by Lasso.
## Exploitation
- Status: Informational/Research findings (No report of active malicious exploitation in the wild, but PoC is implied by researcher findings)
- Complexity: Low (Requires knowledge of the right query to illicit the cached data)
- Attack Vector: Network
## Impact
- Confidentiality: High (Exposure of potentially sensitive source code or internal data from organizations like Google, IBM, PayPal, and Microsoft itself)
- Integrity: Low (Primarily data leakage, not modification)
- Availability: Negligible
## Remediation
### Patches
- Specific official patches for this indexing flaw are not detailed in the provided article. Remediation requires action from Microsoft regarding their indexing/caching mechanisms for code hosting services.
### Workarounds
- **For Data Owners:** Ensure that sensitive repositories are never briefly made public, as this window is sufficient for indexing. Monitor third-party tool dependencies (like Copilot) to ensure they purge stale data from their training sets or indices.
- **For Users:** Avoid prompting Copilot with specific queries related to the contents of believed-to-be-private repositories.
## Detection
- **Indicators of Compromise:** Discovery of organizational code snippets or proprietary information in Copilot outputs that should not be there.
- **Detection Methods and Tools:** Proactively querying Copilot with known filenames or unique string literals from internal, recently-secured repositories. Researchers (Lasso) used this method to map exposed data.
## References
- [TechCrunch Article - Thousands of exposed GitHub repositories, now private, can still be accessed through Copilot](https://techcrunch.com/2025/02/26/thousands-of-exposed-github-repositories-now-private-can-still-be-accessed-through-copilot/)