Evaluation Benchmarks for AI Security
Synthesis
The raw corpus is benchmark-heavy, but benchmark validity and deployment transfer remain open questions.
Evidence Base
This page was generated from the batch ingest of SRC-20260702-raw-papers-batch, SRC-20260702-raw-whitepapers-batch, and SRC-20260702-raw-news-batch. Treat it as a navigation and synthesis page; promote individual statements into claim pages when they become decision-relevant.
Representative Sources
| Title | Kind | Date | Tags | Raw |
|---|---|---|---|---|
| REALM: A Unified Red-Teaming Benchmark for Physical-World VLMs | paper | 2026-06-23 | ai-security, benchmark, multimodal-security, red-teaming, vlm | raw |
| GLM 5.2 on CyberBT-CTF: The strongest open source contender to Anthropic/OpenAI we have tested | industry_blog | 2026-06-23 | ai-security, cyber-benchmarks, glm-5-2, model-distillation, open-weight-models | raw |
| We have Mythos at Home: GLM 5.2 beats Claude in our Cyber Benchmarks | industry_blog | 2026-06-22 | ai-security, cyber-benchmarks, glm-5-2, idor, vulnerability-detection | raw |
| SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents | paper | 2026-06-16 | Chao Xu, Hanting Chen, Haocheng Mei, Mengyu Zheng, Xinghao Chen, Ye Yuan | raw |
| Benchmarking Security Risk Detection and Verification in Open Agentic Skill Ecosystems | paper | 2026-05-30 | Ismail Hossain, Nan Jiang, Sai Puppala, Sajedul Talukder, Zhuoran Lu, agent-skills | raw |
| SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks | paper | 2026-02 | agent-skills, benchmark, self-evolving-agents, self-generated-skills, skillsbench | raw |
| SEC-bench: Automated Benchmarking of LLM Agents on Real-World Security Tasks | paper | 2025-06 | ai-for-security, benchmark, llm-agents, security-tasks, vulnerability-reproduction | raw |
| WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks | paper | 2025-04-22 | Aaron Grattafiori, Arman Zharmagambetov, Chuan Guo, Ivan Evtimov, Kamalika Chaudhuri, benchmark | raw |
| CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities | paper | 2025-03 | ai-for-security, benchmark, cve-bench, exploit-evaluation, llm-agents, web-security | raw |
| CyberSecEval 4 | whitepaper | 2025 | ai-for-security, autopatchbench, benchmark, cyberseceval, vulnerability-patching | raw |
| CVE-Bench: Benchmarking LLM-based Software Engineering Agents' Ability to Fix Real-world Vulnerabili | paper | 2025 | ai-for-security, benchmark, cve-bench, software-engineering-agents, vulnerability-repair | raw |
| Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM Agents | paper | 2024-10-03 | agents, ai-security, benchmark, defenses, memory-poisoning, prompt-injection | raw |
Open Questions
- Which evidence in this cluster is strongest enough to support a stable claim?
- Which results generalize across models, tools, and deployment settings?
- Which controls have been evaluated under realistic adversarial conditions?