. * .  .  *
  *    .    *
     .-~~~-.
  __|_______|__
 (  .  .  .  . )
  '~~~~~~~~~~~~~'
      |   |
   *  .   .  *
  .  *  .    .
0x0B // AREA41::2026
Phishing For Patterns: What Happens When Agents Explore Domain Data
SPEAKER: Beni Urech & Patrick Schläpfer DURATION: 30:39

Over five years, Beni Urech and Patrick Schläpfer amassed a massive dataset to track phishing campaigns, specifically targeting threat actors like TA505 and TA551. Their pipeline pulls from certificate transparency logs and zone files into Kafka, transforms the data with Vector, and stores it in a ClickHouse database. This system holds roughly 3.1 billion domains and 50 terabytes of data. While they developed a custom query language for human analysts to build detections, they needed a way to automate the hunting process without rewriting rigid Python scripts for every new threat actor.

To automate these hunts, they turned to AI agents. Because language models are stateless and burn through context window limits quickly, they mapped their existing REST API into a Model Context Protocol (MCP) server, allowing models to query the database via remote tools. During iterative development, they learned that tool descriptions act as the primary user interface for language models. Vague prompts led to persistent syntax errors in their custom query language, requiring them to write explicit, example-heavy instructions. They also shifted database outputs from verbose JSON to a dense CSV format, drastically reducing token consumption when handling tabular data.

The speakers discovered that leaving sequential, rigid workflows up to a model's reasoning engine caused inefficient loops and execution mistakes. To fix this, they introduced hardcoded compound tools that package multi-step API sequences into deterministic actions. This constrained the models, forcing them to focus their reasoning purely on data analysis rather than the mechanical order of operations.

Testing these agents on active phishing campaigns—such as impersonations of Swiss financial apps like Twint and PostFinance—revealed stark differences in model capabilities. Frontier models performed best, though with different strengths: Opus excelled at discovering the broad infrastructure footprint of a campaign cluster, while GPT effectively analyzed page source code to extract developer artifacts like Telegram bot tokens. Local dense models, specifically Qwen, proved surprisingly capable for initial triage and identifying unique clusters. Weaker models struggled with noise discipline, frequently generating false positives by incorrectly assuming every domain sharing a Cloudflare IP address was malicious. To counter this, the engineers implemented pivoting strategies based on specific name server pairings and content hashes.

// This summary was generated by AI. AI can make mistakes. If in doubt, watch the original conference recording.