AI Security Research Portal
conceptactiveConcepts

Evaluation Benchmarks for AI Security

Synthesis

The raw corpus is benchmark-heavy, but benchmark validity and deployment transfer remain open questions.

Evidence Base

This page was generated from the batch ingest of SRC-20260702-raw-papers-batch, SRC-20260702-raw-whitepapers-batch, and SRC-20260702-raw-news-batch. Treat it as a navigation and synthesis page; promote individual statements into claim pages when they become decision-relevant.

Representative Sources

TitleKindDateTagsRaw
REALM: A Unified Red-Teaming Benchmark for Physical-World VLMspaper2026-06-23ai-security, benchmark, multimodal-security, red-teaming, vlmraw
GLM 5.2 on CyberBT-CTF: The strongest open source contender to Anthropic/OpenAI we have testedindustry_blog2026-06-23ai-security, cyber-benchmarks, glm-5-2, model-distillation, open-weight-modelsraw
We have Mythos at Home: GLM 5.2 beats Claude in our Cyber Benchmarksindustry_blog2026-06-22ai-security, cyber-benchmarks, glm-5-2, idor, vulnerability-detectionraw
SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agentspaper2026-06-16Chao Xu, Hanting Chen, Haocheng Mei, Mengyu Zheng, Xinghao Chen, Ye Yuanraw
Benchmarking Security Risk Detection and Verification in Open Agentic Skill Ecosystemspaper2026-05-30Ismail Hossain, Nan Jiang, Sai Puppala, Sajedul Talukder, Zhuoran Lu, agent-skillsraw
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Taskspaper2026-02agent-skills, benchmark, self-evolving-agents, self-generated-skills, skillsbenchraw
SEC-bench: Automated Benchmarking of LLM Agents on Real-World Security Taskspaper2025-06ai-for-security, benchmark, llm-agents, security-tasks, vulnerability-reproductionraw
WASP: Benchmarking Web Agent Security Against Prompt Injection Attackspaper2025-04-22Aaron Grattafiori, Arman Zharmagambetov, Chuan Guo, Ivan Evtimov, Kamalika Chaudhuri, benchmarkraw
CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilitiespaper2025-03ai-for-security, benchmark, cve-bench, exploit-evaluation, llm-agents, web-securityraw
CyberSecEval 4whitepaper2025ai-for-security, autopatchbench, benchmark, cyberseceval, vulnerability-patchingraw
CVE-Bench: Benchmarking LLM-based Software Engineering Agents' Ability to Fix Real-world Vulnerabilipaper2025ai-for-security, benchmark, cve-bench, software-engineering-agents, vulnerability-repairraw
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM Agentspaper2024-10-03agents, ai-security, benchmark, defenses, memory-poisoning, prompt-injectionraw

Open Questions

Related