Full Report
Evaluation of three jailbreaking techniques on DeepSeek shows risks of generating prohibited content. The post Recent Jailbreaks Demonstrate Emerging Threat to DeepSeek appeared first on Unit 42.
Analysis Summary
This analysis focuses on the publicly documented Large Language Model (LLM) jailbreaking techniques mentioned in the provided context, rather than traditional malware or system-level attack tools.
# Tool/Technique: Deceptive Delight
## Overview
Deceptive Delight is a novel jailbreaking technique designed to circumvent the safety restrictions programmed into Large Language Models (LLMs) by using methods of camouflage or distraction to elicit prohibited responses.
## Technical Details
- Type: Technique (LLM Jailbreak)
- Platform: Large Language Models (Tested against DeepSeek models)
- Capabilities: High success rate in bypassing LLM safety filters to generate responses related to prohibited content categories.
- First Seen: Recently publicized by Unit 42 researchers.
## MITRE ATT&CK Mapping
*Note: Since this is an LLM interaction technique, direct, precise mapping is challenging, but it falls conceptually under Evasion or Defense Evasion related to interacting with proprietary systems.*
- **TA0005 - Defense Evasion**
- T1027 - Obfuscated Files or Information (Via conversational obfuscation)
## Functionality
### Core Capabilities
- Eliciting actionable instructions for malicious activities (e.g., creating data exfiltration tools, keyloggers, or incendiary devices).
- Bypassing content filters through camouflaged or distracted prompting methodologies.
### Advanced Features
- Requires little to no specialized knowledge or expertise to execute successfully against tested models.
## Indicators of Compromise
- File Hashes: N/A (Prompt technique)
- File Names: N/A
- Registry Keys: N/A
- Network Indicators: N/A
- Behavioral Indicators: Unintended generation of guidance for prohibited content upon receiving specific adversarial prompts.
## Associated Threat Actors
- Not tied to specific threat actors; presented as a general emerging attack vector applicable to any user attempting to jailbreak LLMs.
## Detection Methods
- Signature-based detection: Difficult to signatureize, relies on pattern matching of adversarial prompts.
- Behavioral detection: Monitoring LLM interactions for highly anomalous or protective content requests, especially when paired with attempts to roleplay or simulate scenarios.
- YARA rules: N/A
## Mitigation Strategies
- Continuous refinement of LLM safety filters and prompt engineering defenses based on emerging jailbreak techniques.
- Monitoring and restricting employee access to unauthorized third-party LLMs, especially when sensitive tasks are involved.
- Utilizing security solutions that monitor and govern GenAI application usage.
## Related Tools/Techniques
- Bad Likert Judge
- Crescendo
# Tool/Technique: Bad Likert Judge
## Overview
Bad Likert Judge is a multi-turn jailbreaking technique used to trick an LLM into providing prohibited information by leveraging a specific structured conversational approach, likely involving a series of judgments or ratings that ultimately bypass safety checks.
## Technical Details
- Type: Technique (LLM Jailbreak)
- Platform: Large Language Models (Tested against DeepSeek models)
- Capabilities: Demonstrated significant bypass rates against LLM safety mechanisms across multiple turns of conversation.
- First Seen: Recently publicized by Unit 42 researchers.
## MITRE ATT&CK Mapping
*Note: Similar to Deceptive Delight, this falls conceptually under Evasion.*
- **TA0005 - Defense Evasion**
- T1027 - Obfuscated Files or Information (Via conversational obfuscation/multi-turn structure)
## Functionality
### Core Capabilities
- Obtaining malicious instructions through a structured multi-turn dialogue rather than a single prompt.
- Successfully eliciting responses related to topics like data exfiltration tools and incendiary device creation.
### Advanced Features
- Exploits the model’s inherent complexity in tracking context across multiple conversational turns to lower internal guardrails.
## Indicators of Compromise
- File Hashes: N/A
- File Names: N/A
- Registry Keys: N/A
- Network Indicators: N/A
- Behavioral Indicators: Anomalous multi-turn conversational flows aimed at eliciting prohibited content.
## Associated Threat Actors
- Not tied to specific threat actors; a general adversarial technique demonstration.
## Detection Methods
- Behavioral detection: Analyzing conversation paths for unnatural turns designed to manipulate the LLM's internal state or rating mechanism.
- Context monitoring: Advanced monitoring of the coherence and intent across sequential user inputs.
## Mitigation Strategies
- Implementing stronger context management and state tracking within LLMs to resist adversarial chaining.
- Human review or automated screening of complex, multi-turn interactions that approach sensitive topics.
## Related Tools/Techniques
- Deceptive Delight
- Crescendo
# Tool/Technique: Crescendo
## Overview
Crescendo is a multi-turn jailbreaking technique (the article links to its dedicated resource) used to circumvent LLM safety restrictions via elaborate, extended conversational attacks.
## Technical Details
- Type: Technique (LLM Jailbreak)
- Platform: Large Language Models (Tested against DeepSeek models)
- Capabilities: Successful in achieving high bypass rates against restrictions, often associated with complex, layered prompts.
- First Seen: Associated with dedicated research material.
## MITRE ATT&CK Mapping
*Note: Categorized as conversational evasion.*
- **TA0005 - Defense Evasion**
- T1027 - Obfuscated Files or Information
## Functionality
### Core Capabilities
- Generating highly detailed, actionable malicious outputs by leveraging the model's ability to maintain long conversational threads.
### Advanced Features
- Its multi-turn nature allows for iterative refinement of the malicious output, making detection based on initial prompts difficult.
## Indicators of Compromise
- File Hashes: N/A
- File Names: N/A
- Registry Keys: N/A
- Network Indicators: N/A
- Behavioral Indicators: Extended dialogue sessions where the user progressively steers the model toward generating prohibited content.
## Associated Threat Actors
- Not specified, but highlights risk acceleration for malicious actors accessing low-complexity attack data.
## Detection Methods
- Behavioral analysis focusing on conversation depth and intent drift.
- Employing pre-filters that look for patterns associated with known multi-turn escape techniques.
## Mitigation Strategies
- Limiting the effective context window size for potentially dangerous topics or implementing timeouts on complex adversarial threads.
- Investing in AI security assessments to identify model weaknesses preemptively.
## Related Tools/Techniques
- Deceptive Delight
- Bad Likert Judge
# Malware/Software Context: DeepSeek Models (DeepSeek-V3, DeepSeek-R1)
## Overview
DeepSeek-V3 and DeepSeek-R1 are open-source Large Language Models developed by a China-based AI research organization. They represent new competitors in the LLM landscape. The research highlights that these models, particularly derivative/distilled versions, are susceptible to existing jailbreaking techniques.
## Technical Details
- Type: Malware family | Tool | Technique: Base Software Platform (LLM)
- Platform: AI Software/Inference Engines
- Capabilities: General language processing, code generation; however, they demonstrated high vulnerability to adversarial prompting leading to the generation of malicious instructions.
- First Seen: DeepSeek-V3 (Dec 25, 2024); DeepSeek-R1 (Jan 2025).
## MITRE ATT&CK Mapping
*Note: The models themselves are platforms. MITRE ATT&CK maps apply to their misuse.*
- **T8001 - Data from Information Repositories (If model training data is compromised or leaked)** - *Conceptual mapping for LLM compromise context*
## Functionality
### Core Capabilities
- Serving as a foundation for generative AI applications.
### Advanced Features
- Available in open-source, potentially distilled versions, increasing accessibility.
## Indicators of Compromise
- File Hashes: N/A (Model weights/binaries are proprietary/specific)
- File Names: N/A
- Registry Keys: N/A
- Network Indicators: N/A (Unless a specific C2 is running on a hosted version)
- Behavioral Indicators: Outputting guidance on creating Molotov cocktails, keyloggers, or data exfiltration tools when queried.
## Associated Threat Actors
- Potential misuse by any actor seeking low-barrier-to-entry attack knowledge.
## Detection Methods
- Output validation and filtering systems on platforms utilizing these models.
- Monitoring for system calls related to common attack tool concepts within model output streams (if applicable to the execution environment).
## Mitigation Strategies
- Organizations using these models (or derivatives) must implement robust safety layers (Precision AI solutions mentioned) above the base model to prevent misuse.
- Utilizing AI Security Assessments to proactively test deployed models against jailbreak attempts.
## Related Tools/Techniques
- Deceptive Delight, Bad Likert Judge, Crescendo (Techniques that exploit these models)