Full Report
Uncover real-world indirect prompt injection attacks and learn how adversaries weaponize hidden web content to exploit LLMs for high-impact fraud. The post Fooling AI Agents: Web-Based Indirect Prompt Injection Observed in the Wild appeared first on Unit 42.
Analysis Summary
# Tool/Technique: Indirect Prompt Injection (Web-Based)
## Overview
Indirect Prompt Injection is an adversarial technique where an attacker places malicious instructions within a data source (such as a website, document, or email) that an AI Agent is likely to process. When the Large Language Model (LLM) retrieves this content, the hidden instructions hijack the model’s control flow, forcing it to perform unauthorized actions—such as data exfiltration or fraud—while appearing to follow the user's original request.
## Technical Details
- **Type**: Adversarial Technique / Prompt Injection
- **Platform**: LLM-integrated Applications, AI Agents (e.g., ChatGPT Browsing, Microsoft Copilot, Gemini)
- **Capabilities**: Instruction hijacking, cross-site scripting (XSS) in AI context, data exfiltration, automated fraud.
- **First Seen**: Academic research (early 2023); Unit 42 reported "in the wild" observations in late 2024.
## MITRE ATT&CK Mapping
*Note: MITRE is currently developing the ATLAS framework specifically for AI, but these map to standard ATT&CK concepts:*
- **TA0001 - Initial Access**
- T1566.002 - Phishing: Spearphishing Link (via malicious web content)
- **TA0002 - Execution**
- T1204.003 - User Execution: Malicious File/Content (AI Agent processes the site)
- **TA0010 - Exfiltration**
- T1048 - Exfiltration Over Alternative Protocol (Exfiltration via URL parameters/Markdown images)
- **TA0003 - Persistence**
- T1133 - External Remote Services (Hijacking session logic)
## Functionality
### Core Capabilities
- **Command Overriding**: Uses "jailbreak" style phrases to tell the LLM to ignore previous instructions and follow new ones found on the page.
- **Context Hijacking**: Occupies the "context window" of the AI to ensure the malicious instructions are prioritized.
- **Hidden Text Injection**: Employs zero-font-size text, white-on-white text, or HTML comments to hide instructions from human eyes while keeping them visible to the AI scraper.
### Advanced Features
- **Automated Data Exfiltration**: Instructs the AI agent to summarize sensitive user data and append it to a URL as a query parameter (e.g., sending data to `hxxps[://]attacker[.]com/log?data=[SENSITIVE_INFO]`).
- **Markdown Rendering Exploitation**: Uses Markdown image syntax `` to force the AI to make an automated GET request to an attacker-controlled server without explicit user consent.
- **Multi-Stage Injection**: A "lure" site redirects the AI agent to a second, more complex payload site to evade simple crawlers.
## Indicators of Compromise
- **File Names**: N/A (Web-based)
- **Network Indicators**:
- `hxxps[://]exploit-as-a-service[.]com/` (Defanged example)
- Unusual outbound traffic from AI backend services to known malicious domains.
- Queries containing strings like: `[IGNORE PREVIOUS INSTRUCTIONS]`, `[SYSTEM_UPDATE_PROMPT]`, or `[SECRET_MODE_ENABLED]`.
- **Behavioral Indicators**:
- AI Agent attempts to access local environment variables or personal user history after visiting a specific URL.
- AI Agent generates hidden Markdown images or hyperlinked buttons that point to third-party domains with long, encoded strings in the URL.
## Associated Threat Actors
- **Independent Fraudsters**: Observed using this for automated gift card theft and credential harvesting.
- **Research/Red Teams**: Initially popularized by security researchers (e.g., Greshake et al.).
## Detection Methods
- **Input/Output Filtering**: Use secondary LLMs to "sanitize" or check the output of the primary LLM for commands or data exfiltration attempts.
- **Differential Analysis**: Compare the AI's behavior when processing a "clean" text-only version of a site vs. the full HTML content.
- **LLM Guardrails**: Implement specific software layers (e.g., NeMo Guardrails) designed to detect adversarial shifts in prompt intent.
## Mitigation Strategies
- **Human-in-the-Loop (HITL)**: Require explicit user approval before an AI agent performs an "action" (sending an email, making a purchase, or clicking a link).
- **Content Segregation**: Treat retrieved web content as "Untrusted Data" and use "System Prompts" that explicitly instruct the model to never treat web content as instructions.
- **Resource Constraints**: Limit the AI agent's ability to render Markdown images or perform automated HTTP GET/POST requests to external domains.
## Related Tools/Techniques
- **Direct Prompt Injection** (Jailbreaking)
- **Prompt Leaking** (Extracting system instructions)
- **Adversarial Machine Learning** (Evasion attacks)