Breaking Every Guardrail Everywhere All At Once // AREA41 2026

SUMMARY

Penetration testing for large language model (LLM) applications is often bogged down by a mismatch between available tooling and real-world implementations. Security engineers are frequently faced with features like "summarize my inbox" embedded deep within complex application workflows. Standard red-teaming tools expect direct API access and focus on model-level safety rather than application logic. This disconnect leaves security teams manually crafting custom payloads to test specific environments, such as file uploads to an S3 bucket that trigger asynchronous Lambda functions.

To address this gap, Donato Capitella and their team at ReverSec developed Spiky (Simple Prompt Injection Kit for Evaluation and Exploitation). Spiky is an open-source framework designed to decouple attack payloads from target applications via a thin Python abstraction layer. Instead of firing generic prompts at an inference endpoint, Spiky uses composable datasets to mix base user inputs, jailbreak patterns, and specific malicious instructions like markdown image exfiltration or cross-site scripting. In a demonstration against an inbox summarization tool, Spiky quickly established baseline exploitability. When faced with active defenses like Azure Prompt Shield, the tool launched iterative attacks, using a "Best of N" technique to run up to 10 variations per payload, effectively doubling the attack success rate.

The danger of relying on LLMs for application logic was highlighted in a case study involving a SQL-based banking agent. Developers attempted to enforce data access controls by instructing the LLM to append a user ID WHERE clause to generated database queries. When Capitella bypassed this restriction using an anti-spotlighting technique, the developers responded with a naive output filter checking for the exact string. In response, they deployed an uncensored 35-billion parameter Qwen model running locally on an AMD Strix Halo workstation. Operating as an autonomous attacker, the local model discovered a working jailbreak that bypassed the new filter in just three attempts. Ultimately, the exercise proved that security controls must exist strictly outside the generative loop, relying on deterministic programmatic mechanisms like PostgreSQL row-level security.

KEY TAKEAWAYS

Prompt injection vulnerabilities cannot be mitigated by instructing the language model to behave securely; data access must be enforced programmatically through deterministic mechanisms like PostgreSQL row-level security.
Application-level rate limiting must be implemented on a strict per-user basis, as relying on upstream API provider limits allowed a single user to consume $10,000 worth of tokens in a few hours.
Effective penetration testing of language model integrations requires decoupling the attack payload from the target interface, allowing security teams to test complex, asynchronous event chains involving systems like S3 buckets and Lambda functions.
Automated defensive guardrails like Azure Prompt Shield can be systematically bypassed by automating iterative attacks, such as generating up to 10 noise-injected variations of a payload to locate filter blind spots.
Small, uncensored local models running on standard consumer hardware can operate autonomously to discover and exploit novel logic flaws in enterprise language model applications within minutes.