Full Report
Wiz found two critical security risks that were present in Hugging Face’s environment:Specifically, Wiz Research showed that an attacker targeting Hugging Face could have achieved the following:Wiz Research were able to achieve remote code execution through a specially-crafted...
Analysis Summary
# Research: Hugging Face Cross-Tenant Access and Infrastructure Takeover Risks
## Metadata
- Authors: Wiz Research Team (Implied via Wiz publication)
- Institution: Wiz Research
- Publication: Wiz Blog/Security Advisory
- Date: April 4, 2024
## Abstract
This research details the discovery and remediation of two critical security risks within Hugging Face's cloud environment, identified through security analysis by Wiz. These vulnerabilities could have allowed an attacker to achieve cross-tenant access, leading to the compromise of other customers' data and enabling potential supply chain attacks targeting users of the Hugging Face platform. The primary attack vectors exploited weaknesses in model deserialization (pickle files), user-submitted container definitions (Dockerfiles), and subsequent misconfigurations in instance metadata services (IMDS) and container registry access control.
## Research Objective
The primary objective was to assess the security posture of Hugging Face’s inference and development infrastructure, specifically targeting risks associated with executing untrusted user-provided code or models, and to demonstrate the potential for lateral movement and cross-tenant data exposure.
## Methodology
### Approach
The research involved active threat modeling and exploitation attempts based on known cloud weaknesses and specific Hugging Face features that allow user-submitted code execution (e.g., model loading, Spaces/Inference API usage).
### Dataset/Environment
The scope was the production environment used by Hugging Face for hosting AI models and running user-defined applications via features like the Inference API and Hugging Face Spaces.
### Tools & Technologies
Standard penetration testing techniques were employed alongside methods tailored for ML environments, including specially crafted AI model files (pickle) and custom Dockerfiles.
## Key Findings
### Primary Results
1. **Shared Inference Infrastructure Takeover Risk:** An attacker running a specially crafted, malicious AI model could escalate privileges to gain unauthorized, cross-tenant access to other customers' model data hosted on shared inference infrastructure.
2. **Shared CI/CD Takeover Risk:** By compromising the build environment via a malicious AI application, an attacker could have executed a supply chain attack against Hugging Face customers consuming artifacts from their shared Continuous Integration/Continuous Deployment (CI/CD) system.
3. **Remote Code Execution (RCE) via Pickle File:** RCE was successfully demonstrated by modifying the configuration files associated with a standard Python `pickle` model format, triggering a reverse shell upon loading via the Inference API.
4. **RCE via Malicious Dockerfile:** RCE was achieved through user-provided `Dockerfile` inputs in Hugging Face Spaces, exploiting the `CMD` instruction to execute payloads upon container startup and potentially the `RUN` instruction during the build process.
5. **Write Access to Centralized Container Registry:** Following initial RCE, the researchers discovered a scope misconfiguration on an internal container registry, which allowed overwriting *any* image stored there, posing a significant supply chain threat.
### Supporting Evidence
The execution pathways were validated by successfully deploying a reverse shell and subsequently retrieving credentials from the Instance Metadata Service (IMDS) of the host environment. Lateral movement within the AWS EKS environment was confirmed by abusing known misconfigurations.
### Novel Contributions
The research directly maps common cloud weaknesses (IMDS abuse, insecure container registry policies) onto the specific execution contexts provided by the ML ecosystem (model loading, Spaces), demonstrating a direct path to cross-tenant data compromise in this unique environment.
## Technical Details
**RCE via Pickle:** The researchers cloned a legitimate model structure, modified files like `config.json` to invoke harmful code (e.g., a reverse shell) upon loading, and uploaded this modified model. Interacting with the model via the Inference API triggered the malicious code execution.
**Lateral Movement:** Once RCE was achieved, the environment (implied to be using AWS EKS) allowed the attacker to leverage common misconfigurations to pivot from the sandboxed model execution environment to access sensitive resources, including cross-tenant customer data and the central container registry.
**Container Registry Abuse:** The core vulnerability here was a **scoping misconfiguration**. The internal container registry, used by various users and internal systems, did not properly segment access, allowing the compromised host to push and potentially overwrite base images or dependencies used by other tenants, enabling software supply chain attacks.
## Practical Implications
### For Security Practitioners
This highlights the persistent danger of deserialization attacks (even in modern environments) and the critical need to treat all user-uploaded content—especially AI models—as inherently untrusted. Furthermore, it underscores the importance of hardening the underlying cloud infrastructure supporting multi-tenant ML platforms.
### For Defenders
1. **Input Validation:** Implement strict validation and sanitization for model file formats, especially Python-specific formats like pickle, moving towards safer execution environments (e.g., ONNX).
2. **Container Hardening:** Review and strictly enforce least privilege on container build systems (CI/CD) and runtime environments, ensuring containers cannot access metadata services or internal registries without explicit, scoped IAM roles.
3. **Registry Segmentation:** Immediately audit internal container registries to ensure proper tenant segmentation, preventing one compromised build from affecting others.
### For Researchers
This provides a template for investigating security pitfalls unique to ML platforms, specifically focusing on the convergence of cloud misconfigurations (IMDS, IAM) and ML artifact handling (model loading, training environments).
## Limitations
The summary is based on Wiz's public disclosure. Specific details regarding the exact EKS misconfigurations used for lateral movement are likely omitted for security reasons.
## Comparison to Prior Work
While RCE via pickle files and Dockerfile injection are established attack vectors, this research uniquely demonstrates their application within the specific context of a large-scale, multi-tenant AI model hosting platform (Hugging Face), emphasizing the catastrophic potential of cross-tenant contamination in cloud-native ML infrastructure.
## Future Work
Future work should focus on automated detection mechanisms for malicious model payloads before they are deployed to inference clusters and developing standardized, verifiable secure execution environments for untrusted AI artifacts.
## References
- Wiz Blog Post: https://www.wiz.io/blog/wiz-and-hugging-face-address-risks-to-ai-infrastructure
- Hugging Face Response: https://huggingface.co/blog/hugging-face-wiz-security-blog
- Related Talk: https://www.wiz.io/crying-out-cloud/croc-talks-helping-secure-hugging-face-hub-special-guest-shir-tamari