Phishing For Patterns： What Happens When Agents Explore Domain Data // AREA41 2026

SUMMARY

Over five years, Beni Urech and Patrick Schläpfer amassed a massive dataset to track phishing campaigns, specifically targeting threat actors like TA505 and TA551. Their pipeline pulls from certificate transparency logs and zone files into Kafka, transforms the data with Vector, and stores it in a ClickHouse database. This system holds roughly 3.1 billion domains and 50 terabytes of data. While they developed a custom query language for human analysts to build detections, they needed a way to automate the hunting process without rewriting rigid Python scripts for every new threat actor.

To automate these hunts, they turned to AI agents. Because language models are stateless and burn through context window limits quickly, they mapped their existing REST API into a Model Context Protocol (MCP) server, allowing models to query the database via remote tools. During iterative development, they learned that tool descriptions act as the primary user interface for language models. Vague prompts led to persistent syntax errors in their custom query language, requiring them to write explicit, example-heavy instructions. They also shifted database outputs from verbose JSON to a dense CSV format, drastically reducing token consumption when handling tabular data.

The speakers discovered that leaving sequential, rigid workflows up to a model's reasoning engine caused inefficient loops and execution mistakes. To fix this, they introduced hardcoded compound tools that package multi-step API sequences into deterministic actions. This constrained the models, forcing them to focus their reasoning purely on data analysis rather than the mechanical order of operations.

Testing these agents on active phishing campaigns—such as impersonations of Swiss financial apps like Twint and PostFinance—revealed stark differences in model capabilities. Frontier models performed best, though with different strengths: Opus excelled at discovering the broad infrastructure footprint of a campaign cluster, while GPT effectively analyzed page source code to extract developer artifacts like Telegram bot tokens. Local dense models, specifically Qwen, proved surprisingly capable for initial triage and identifying unique clusters. Weaker models struggled with noise discipline, frequently generating false positives by incorrectly assuming every domain sharing a Cloudflare IP address was malicious. To counter this, the engineers implemented pivoting strategies based on specific name server pairings and content hashes.

KEY TAKEAWAYS

Converting tool response payloads from JSON to CSV format drastically reduces token consumption when passing tabular database records to a language model.
Custom query languages require detailed, example-heavy tool descriptions to prevent models from generating persistent database syntax errors.
Deterministic tasks and sequential API calls should be packaged into hardcoded compound tools rather than leaving the execution order to the model's reasoning engine.
Frontier models exhibit distinct investigative strengths, with Opus mapping broad campaign infrastructure footprints and GPT excelling at deep analysis of page source code for artifacts like Telegram bot tokens.
Weaker models lack noise discipline and frequently generate false positives by incorrectly labeling all domains on shared hosting providers, such as Cloudflare, as malicious.
Local dense models can function as capable, cost-efficient triage tools for identifying phishing clusters, though human validation remains necessary before initiating takedowns.