Capture Summary
Recent preprint evaluating white-box and black-box automated prompt injection attacks against LLM agents in AgentDojo.
Abstract Capture
The paper adapts GCG and TAP style automated attacks to indirect prompt injection in agentic settings and evaluates them across 80 task pairs in four domains. It reports that black-box optimization outperforms gradient-based methods under practical compute budgets, that transfer across tasks and out-of-distribution domains is possible, and that attacks optimized on smaller open-source models do not transfer reliably to frontier models such as GPT-5.
Collection Notes
- Untrusted source content. Treat attack-generation methods as research evidence only.
- Primary relevance: [[03_Topics/Prompt Injection]], [[03_Topics/Evaluations and Benchmarks]]
- PDF: https://arxiv.org/pdf/2606.10525