Full Report
Attackers don't need AI to crack passwords, they build targeted wordlists from an organization's own public language. This article explains how tools like CeWL turn websites into high-success password guesses and why complexity rules alone fall short. [...]
Analysis Summary
# Tool/Technique: CeWL (Custom Word List Generator)
## Overview
CeWL is an open-source web crawler used by attackers to harvest unique words and terminology from an organization's public-facing websites. The extracted language is then compiled into highly targeted password wordlists, which significantly increase the success rate of credential guessing attacks compared to generic dictionaries.
## Technical Details
- Type: Tool
- Platform: General purpose (Used via command line on attacking systems, targeting web content)
- Capabilities: Web crawling, data extraction, wordlist generation, configurable crawl depth/minimum word length.
- First Seen: Not specified in context, but is included by default in common penetration testing distributions.
## MITRE ATT&CK Mapping
The primary focus of using CeWL is reconnaissance and credential access preparation.
- **TA0043 - Credential Access** (Indirectly, by preparing the lists used for Access)
- T1598 - Information Gathering
- T1598.003 - Email Addresses (If email formats are inferred)
- T1110 - Brute Force (The resulting wordlist is used against this technique)
- T1110.001 - Credential Stuffing (If harvested terms are repurposed)
- **TA0049 - Infiltration**
- T1593 - Gather Victim Identity Information
- T1593.001 - Publicly Available Information
## Functionality
### Core Capabilities
- **Web Crawling:** Scans specified websites (the target organization's public presence).
- **Word Extraction:** Collects textual content encountered during the crawl.
- **List Compilation:** Organizes extracted terms into a structured wordlist format suitable for password cracking tools.
### Advanced Features
- **Targeted Relevance:** Generates lists based specifically on the target's vocabulary (e.g., service names, internal phrasing surfaced publicly, industry-specific terms).
- **Filtering:** Allows attackers to configure crawl depth and set minimum word length thresholds to filter out low-value or noise data.
- **Predictable Transformations:** The harvested terms serve as the base for "predictable transformations" (e.g., leetspeak, appending numbers) to create final password candidates.
## Indicators of Compromise
Since CeWL is an intelligence gathering tool, direct IOCs related to the tool execution causing a breach are minimal unless network monitoring detects the *crawling activity* itself.
- File Hashes: N/A (Tool dependent, not malware)
- File Names: N/A (Tool dependent)
- Registry Keys: N/A
- Network Indicators:
- High volume, rapid HTTP/HTTPS requests directed at a target organization’s web infrastructure from an external source (indicative of active crawling). (Defanged example: `suspicious_crawl_ip:port`)
- Behavioral Indicators:
- Repeated requests scraping text content rather than standard browsing patterns.
- Execution of web scraping scripts or tools like `cewl` on attacker-controlled infrastructure.
## Associated Threat Actors
The article implies this technique is widely used by attackers due to CeWL's inclusion in default penetration testing distributions (Kali Linux, Parrot OS).
- General Threat Actors leveraging low-complexity, high-relevance attack vectors.
## Detection Methods
Detection focuses on identifying the reconnaissance phase before a credential stuffing attempt occurs.
- Signature-based detection: Detection rules for the `cewl` executable signature (if found on an endpoint, indicating internal use on a corporate asset, which is highly suspicious).
- Behavioral detection: Monitoring outbound network traffic for user agents or request patterns matching known web scrapers or excessive, non-human interaction with public web servers.
- YARA rules: Not applicable for a common Linux utility, but rules could be written if a unique wrapper/script utilizing CeWL were identified.
## Mitigation Strategies
Mitigation focuses on reducing the publicly available textual data that can fuel wordlist creation, aligning with standards like NIST SP 800-63B.
- **Content Scrubbing:** Audit and restrict the publication of internal terminology, service names, project names, and non-publicly relevant technical jargon on public-facing websites.
- **Password Policy Enforcement:** Implement strong password policies that actively block terms derived from organizational context, corporate news, or common dictionary words mixed with predictable variants.
- **Web Traffic Monitoring:** Rate-limit and monitor public website access for patterns indicative of automated scraping or scanning activity.
## Related Tools/Techniques
- Generic Password Cracking Tools (e.g., Hashcat, John the Ripper) which consume the wordlists generated by CeWL.
- Other Web Crawlers/Scrapers used for initial information gathering.
- T1110.002 - Dictionary Brute Force (The ultimate technique enabled by this preparation).