Full Report
Cybersecurity researchers have shed light on a new jailbreak technique that could be used to get past a large language model's (LLM) safety guardrails and produce potentially harmful or malicious responses. The multi-turn (aka many-shot) attack strategy has been codenamed Bad Likert Judge by Palo Alto Networks Unit 42 researchers Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao, and
Analysis Summary
# Tool/Technique: Bad Likert Judge
## Overview
"Bad Likert Judge" is a novel, multi-turn (many-shot) jailbreaking technique developed by Palo Alto Networks Unit 42 researchers to circumvent the safety guardrails of Large Language Models (LLMs) and elicit potentially harmful or malicious responses. It achieves this by leveraging the LLM's own functionality as a judge using the Likert scale.
## Technical Details
- Type: Technique (Jailbreak/Prompt Injection)
- Platform: Large Language Models (LLMs) / Generative AI Systems
- Capabilities: Increases attack success rates against LLM safety mechanisms by over 60%; uses a multi-turn conversational approach.
- First Seen: Information suggests research published around January 2025 (based on article date).
## MITRE ATT&CK Mapping
The nature of this technique falls under techniques related to circumventing security controls in AI systems, which currently lack precise, finalized mappings compared to traditional IT systems. It pertains most closely to Evasion and Prompt Injection concepts:
* **T1562 - Impair Defenses (Conceptual Alignment)**
- T1562.007 - **Impair Defenses: AI Defenses** (If directly mapped in future frameworks)
* **T1059 - Command and Scripting Interpreter (Conceptual Alignment for Input Manipulation)**
- T1059.013 - Prompt Injection (Specific to LLMs)
## Functionality
### Core Capabilities
- **Role Assignment:** Instructs the target LLM to assume the role of a judge.
- **Harmfulness Scoring:** Requires the LLM to score the harmfulness of a given response using a Likert scale (a psychometric rating scale measuring agreement/disagreement).
- **Iterative Generation:** Prompts the LLM to generate responses corresponding to different scores on the Likert scale.
### Advanced Features
- **Many-Shot Attack:** It is a multi-turn attack, leveraging the LLM's context window to gradually steer the model away from its intended safety protocols, similar to techniques like Crescendo and Deceptive Delight.
- **Exploitation of Judgment Function:** By framing the malicious request through the lens of a subjective rating task (judging harmfulness), it manipulates the model into producing content associated with higher "scores."
## Indicators of Compromise
- File Hashes: N/A (This is a prompt-based technique, not executable malware)
- File Names: N/A
- Registry Keys: N/A
- Network Indicators: N/A (Relies on network communication with the target LLM API/service)
- Behavioral Indicators: Sequences of prompts that explicitly reference Likert scales, request explicit harmful content scoring, and ask for outputs corresponding to the highest rating scores.
## Associated Threat Actors
- Researchers at Palo Alto Networks Unit 42 discovered and documented this technique. (No known threat actor groups were explicitly mentioned as using it, although it is a threat for future actors to deploy.)
## Detection Methods
- Signature-based detection: Not applicable in the traditional sense. Detection relies on analyzing prompt sequences.
- Behavioral detection: Monitoring for long, multi-turn conversational chains that heavily employ meta-analysis or role-playing focused on subjective rating scales applied to potentially prohibited topics.
- YARA rules: N/A
## Mitigation Strategies
- **Robust Input Validation:** Implementing strong filters on input prompts to detect patterns associated with Likert scale misuse or jailbreak templates.
- **Context Management:** Limiting the effectiveness of multi-turn attacks by carefully managing context history or resetting session context frequently.
- **Model Refinement:** Further fine-tuning LLMs specifically against adversarial prompting techniques like role-reversal and simulated evaluation tasks.
- **Defense-in-Depth for LLMs:** Employing adversarial training and reinforcement learning from human feedback (RLHF) tailored to resist these complex, iterative attacks, rather than relying solely on initial guardrails.
## Related Tools/Techniques
- Prompt Injection (General category)
- Many-Shot Jailbreaking
- Crescendo (Related many-shot technique)
- Deceptive Delight (Related many-shot technique)