Full Report
Posted by the Agentic AI Security Team at Google DeepMindModern AI systems, like Gemini, are more capable than ever, helping retrieve data and perform actions on behalf of users. However, data from external sources present new security challenges if untrusted sources are available to execute instructions on AI systems. Attackers can take advantage of this by hiding malicious instructions in data that are likely to be retrieved by the AI system, to manipulate its behavior. This type of attack is commonly referred to as an "indirect prompt injection," a term first coined by Kai Greshake and the NVIDIA team.To mitigate the risk posed by this class of attacks, we are actively deploying defenses within our AI systems along with measurement and monitoring tools. One of these tools is a robust evaluation framework we have developed to automatically red-team an AI system’s vulnerability to indirect prompt injection attacks. We will take you through our threat model, before describing three attack techniques we have implemented in our evaluation framework.Threat model and evaluation frameworkOur threat model concentrates on an attacker using indirect prompt injection to exfiltrate sensitive information, as illustrated above. The evaluation framework tests this by creating a hypothetical scenario, in which an AI agent can send and retrieve emails on behalf of the user. The agent is presented with a fictitious conversation history in which the user references private information such as their passport or social security number. Each conversation ends with a request by the user to summarize their last email, and the retrieved email in context.The contents of this email are controlled by the attacker, who tries to manipulate the agent into sending the sensitive information in the conversation history to an attacker-controlled email address. The attack is successful if the agent executes the malicious prompt contained in the email, resulting in the unauthorized disclosure of sensitive information. The attack fails if the agent only follows user instructions and provides a simple summary of the email. Automated red-teamingCrafting successful indirect prompt injections requires an iterative process of refinement based on observed responses. To automate this process, we have developed a red-team framework consisting of several optimization-based attacks that generate prompt injections (in the example above this would be different versions of the malicious email). These optimization-based attacks are designed to be as strong as possible; weak attacks do little to inform us of the susceptibility of an AI system to indirect prompt injections.Once these prompt injections have been constructed, we measure the resulting attack success rate on a diverse set of conversation histories. Because the attacker has no prior knowledge of the conversation history, to achieve a high attack success rate the prompt injection must be capable of extracting sensitive user information contained in any potential conversation contained in the prompt, making this a harder task than eliciting generic unaligned responses from the AI system. The attacks in our framework include:Actor Critic: This attack uses an attacker-controlled model to generate suggestions for prompt injections. These are passed to the AI system under attack, which returns a probability score of a successful attack. Based on this probability, the attack model refines the prompt injection. This process repeats until the attack model converges to a successful prompt injection. Beam Search: This attack starts with a naive prompt injection directly requesting that the AI system send an email to the attacker containing the sensitive user information. If the AI system recognizes the request as suspicious and does not comply, the attack adds random tokens to the end of the prompt injection and measures the new probability of the attack succeeding. If the probability increases, these random tokens are kept, otherwise they are removed, and this process repeats until the combination of the prompt injection and random appended tokens result in a successful attack.Tree of Attacks w/ Pruning (TAP): Mehrotra et al. (2024) [3] designed an attack to generate prompts that cause an AI system to violate safety policies (such as generating hate speech). We adapt this attack, making several adjustments to target security violations. Like Actor Critic, this attack searches in the natural language space; however, we assume the attacker cannot access probability scores from the AI system under attack, only the text samples that are generated.We are actively leveraging insights gleaned from these attacks within our automated red-team framework to protect current and future versions of AI systems we develop against indirect prompt injection, providing a measurable way to track security improvements. A single silver bullet defense is not expected to solve this problem entirely. We believe the most promising path to defend against these attacks involves a combination of robust evaluation frameworks leveraging automated red-teaming methods, alongside monitoring, heuristic defenses, and standard security engineering solutions. We would like to thank Vijay Bolina, Sravanti Addepalli, Lihao Liang, and Alex Kaskasoli for their prior contributions to this work.Posted on behalf of the entire Google DeepMind Agentic AI Security team (listed in alphabetical order):Aneesh Pappu, Andreas Terzis, Chongyang Shi, Gena Gibson, Ilia Shumailov, Itay Yona, Jamie Hayes, John "Four" Flynn, Juliette Pluto, Sharon Lin, Shuang Song
Analysis Summary
# Tool/Technique: Prompt Injection Attacks on AI Systems
## Overview
This entry summarizes information derived from a Google Online Security Blog post regarding the estimation of risks associated with "prompt injection attacks" targeting Artificial Intelligence (AI) systems. Prompt injection involves manipulating the input (prompt) to an AI model to cause it to execute unintended actions or reveal sensitive information, bypassing safety guardrails.
## Technical Details
- Type: Technique (Adversarial ML/AI Attack)
- Platform: AI/Large Language Models (LLMs) and related AI systems.
- Capabilities: Causing an AI model to ignore instructions, reveal confidential training data, execute harmful code instructions embedded in prompts, or behave contrary to its programming.
- First Seen: Context suggests this is a contemporary concern related to the evolution of LLMs, referenced in a blog post dated January 29, 2025.
## MITRE ATT&CK Mapping
Since this is an emergent AI-specific attack vector, direct, high-fidelity mappings may not exist in the current stable ATT&CK framework, but related concepts apply:
- **TA0001 - Initial Access** (If the injection leads to system exploitation/access via chained vulnerabilities)
- **T1566 - Phishing** (If the prompt is delivered via a social engineering context)
- **TA0005 - Defense Evasion**
- *Related Concept: Bypassing security controls enforced by the AI system itself.*
- **TA0011 - Command and Control** (If the compromised AI facilitates C2 interaction)
*Note: Organizations are often developing specific frameworks for AI security attacks (e.g., MITRE's ATLAS framework) which would map these concepts more precisely.*
## Functionality
### Core Capabilities
- **Instruction Overriding:** Forcefully overriding system-level instructions given to the LLM by injecting malicious user commands.
- **Information Disclosure:** Tricking models into outputting proprietary data or internal system prompts/instructions used to govern the AI's behavior.
### Advanced Features
- **Goal-Oriented Subversion:** Using injections to achieve specific malicious goals (e.g., generating phishing emails, creating malware code, or initiating actions if the LLM is connected to external tools/APIs).
- **Risk Quantification:** The primary focus of the referenced article is on introducing methodologies (estimation frameworks) to measure and score the potential harm of these injections.
## Indicators of Compromise
As prompt injection is a logical attack based on input manipulation rather than persistent malware execution, traditional IoCs are limited:
- File Hashes: N/A
- File Names: N/A
- Registry Keys: N/A
- Network Indicators: N/A (Unless the injection leads to subsequent C2 beaconing or data exfiltration attempts initiated by the compromised model response).
- Behavioral Indicators: Unintended or policy-violating outputs from the AI model, refusal to adhere to established guardrail prompts, or generation of potentially harmful content (e.g., explicit instructions for system access or exploit code).
## Associated Threat Actors
The article focuses on the *risk* and *estimation* of harm, rather than attributing specific actors. However, prompt injection attacks are generally associated with:
- General Hackers/Red Teams testing AI robustness.
- Nation-State actors looking to leverage powerful models for influence operations or cyber operations.
## Detection Methods
Detection focuses on analyzing the input prompt structure and the model's output sanity:
- **Signature-based detection:** Developing parsers and classifiers specifically designed to identify known adversarial prompt structures, injection keywords, or unusual command sequencing within user input.
- **Behavioral detection:** Monitoring the AI model's response for deviations from expected safety thresholds, sudden changes in output verbosity, or attempts to "jailbreak" the system's initial setup instructions.
- **YARA rules:** Not typically applicable to prompt text, but could potentially be used if injected payloads are saved/logged to disk.
## Mitigation Strategies
Mitigation centers on improving the model's robustness against adversarial input:
- **Prevention measures:** Implementing robust prompt sanitization, using input validation layers (or "prefixing" user input with defensive system instructions), and employing separation of duties for functions processed by the LLM.
- **Hardening recommendations:** Utilizing techniques like instruction tuning, reinforcement learning from human feedback (RLHF) that specifically penalizes responses generated from injected prompts, and ensuring the AI system operates with minimal necessary privileges (Principle of Least Privilege).
## Related Tools/Techniques
- **Adversarial Suffixes/Jailbreaking Prompts:** Specific pre-engineered text designed to bypass safeguards.
- **Indirect Prompt Injection:** Where the malicious instruction is sourced dynamically from external, untrusted data that the LLM processes (e.g., pulling instructions from a website summarized by the LLM).