Full Report
Perplexity is repeatedly modifying their user agent and changing IPs and ASNs to hide their crawling activity, in direct conflict with explicit no-crawl preferences expressed by websites.
Analysis Summary
This request describes observation of scraping/crawling behavior indicative of automated data extraction, but it does **not** align with the typical profile of a malicious threat actor engaging in espionage, financial crime, or cyber warfare that threat intelligence analysts usually track.
The description provided seems to be directed at **Perplexity AI's web crawling practices**, not a malicious Advanced Persistent Threat (APT) or financially motivated cybercriminal group. Therefore, the traditional fields used for threat actor analysis (attribution, malware, TTPs like exploiting vulnerabilities) may not be directly applicable or present in the provided context.
Here is the summary based *strictly* on the provided context, framed within the requested threat intelligence structure:
# Threat Actor: Automated Web Scraper (Perplexity AI)
## Attribution & Identity
The entity observed is **Perplexity AI**. This is not a known malicious threat actor, but a legitimate service engaged in automated web crawling/scraping activity.
## Activity Summary
The entity is engaged in continuous web crawling despite encountering explicit `no-crawl` directives (e.g., in `robots.txt`). They actively complicate detection by frequently modifying their User Agents and cycling through different IP addresses and Autonomous System Numbers (ASNs).
## Tactics, Techniques & Procedures
- **Circumvention of Access Controls:** Ignoring explicit directives for automated access restriction.
- **Evasion/Obscurity:** Frequent modification of User Agents.
- **Infrastructure Hopping (Basic Form):** Changing associated IP addresses and ASNs rapidly to evade rate limiting or easy blocking.
- *No specific MITRE ATT&CK IDs are applicable for standard web scraping evasion techniques unless they intersect with recognized malicious operations.*
## Targeting
- Sectors: Any organization publishing publicly accessible web content.
- Geography: Not specified; wherever the organization's web servers are accessible globally.
- Victims: Any website expressing exclusionary preferences regarding automated scraping.
## Tools & Infrastructure
- Malware families used: N/A (Standard crawling/scraping infrastructure).
- Infrastructure (C2, domains, IPs): Unknown, but involves the use of numerous, frequently changing IPs and ASNs.
## Implications
This activity constitutes a violation of website terms of service and intent regarding data usage. While not a destructive cyber threat, persistent, stealthy scraping can induce significant operational load on target servers and lead to unauthorized data acquisition for model training, effectively creating noise in conventional security monitoring systems looking for malicious intrusion indicators.
## Mitigations
- Stronger enforcement of rate limiting based on anomalous traffic patterns rather than just IP reputation.
- Implementing CAPTCHA challenges for traffic exhibiting scraper-like behavior.
- Monitoring User Agent variations for known legitimate scrapers or known malicious entities.