Capture Summary
Paper proposing automated benchmarking of LLM agents on real-world security tasks, including scalable task construction and vulnerability reproduction.
Relevance
- Important for benchmark automation and reproducibility challenge analysis.
- Helps identify how security benchmarks can accidentally reward scaffold shortcuts instead of real capability.
Collection Notes
- PDF capture only from search result; verify canonical abstract page during ingest.