Full Report
The jailbreak technique "Bad Likert Judge" manipulates LLMs to generate harmful content using Likert scales, exposing safety gaps in LLM guardrails. The post Bad Likert Judge: A Novel Multi-Turn Technique to Jailbreak LLMs by Misusing Their Evaluation Capability appeared first on Unit 42.
Analysis Summary
# Tool/Technique: Bad Likert Judge
## Overview
The "Bad Likert Judge" technique is a jailbreaking method designed to bypass the safety guardrails implemented in text-generation Large Language Models (LLMs). It prompts the target LLM to act as a judge, scoring the harmfulness of a proposed response using a Likert scale, and then instructing the LLM to generate responses that correspond to the highest (most harmful) scores on this scale, potentially eliciting malicious content.
## Technical Details
- Type: Technique (LLM Jailbreaking/Evasion)
- Platform: Text-generation Large Language Models (LLMs)
- Capabilities: Bypassing LLM safety guardrails to generate harmful content.
- First Seen: Not specified in context, but presented as a novel research finding.
## MITRE ATT&CK Mapping
*Note: As this is a technique targeting AI models, direct standard ATT&CK mappings might be limited. Mappings are inferred based on the goal of bypassing security controls.*
- **TA0001 - Initial Access** (In the context of gaining unauthorized/unintended execution)
- **T1566 - Phishing** (If used to craft highly convincing malicious text)
- **T1566.002 - Spearphishing Link** (If used to craft malicious links within text)
- **TA0005 - Defense Evasion**
- **T1027 - Obfuscated Files or Information** (Indirectly, by obfuscating malicious intent behind a seemingly analytical task)
## Functionality
### Core Capabilities
- **Role Assignment:** Instructing the LLM to adopt the persona of a judge scoring content harmfulness.
- **Scoring Solicitation:** Utilizing the Likert scale (a rating standard) as a mechanism to define harmfulness thresholds.
- **Harmful Content Generation:** Directing the LLM to generate content examples that maximize the assigned harmfulness score.
### Advanced Features
- **Adversarial Prompting:** Successfully increased the attack success rate (ASR) by over 60% compared to plain attack prompts against tested LLMs.
- **Targeted Evasion:** Exploits edge cases in model training/safety mechanisms to elicit restricted responses.
## Indicators of Compromise
- **File Hashes:** N/A (Technique, not static malware)
- **File Names:** N/A
- **Registry Keys:** N/A
- **Network Indicators:** N/A
- **Behavioral Indicators:** Repeated attempts to force an LLM into an adversarial judging/scoring loop to generate unsafe content.
## Associated Threat Actors
- Research finding; no specific threat actor mentioned, but applicable to actors seeking to utilize LLMs for malicious text generation or security testing.
## Detection Methods
- **Signature-based detection:** Difficult, as the input is natural language, but possible for specific phrase/pattern matching related to Likert scale instructions paired with harmful subject matter requests.
- **Behavioral detection:** Monitoring for unusual sequences of prompts where the LLM is asked to score or evaluate the toxicity/harmfulness of outputs preceding the actual generation request.
- **YARA rules:** N/A
## Mitigation Strategies
- **Prevention Measures:** Continuous improvement and fine-tuning of safety layers (guardrails) to specifically recognize and block prompts that instruct the model to adopt scoring/judging roles concerning toxicity.
- **Hardening Recommendations:** Implementing input validation that detects linguistic patterns associated with jailbreaking techniques, especially those involving comparative scoring or simulated evaluations.
## Related Tools/Techniques
- Other LLM jailbreaking techniques (e.g., role-playing, prefix injection, abstraction/encoding attacks).