conceptactiveConcepts

Evaluation Benchmarks for AI Security

Synthesis

The raw corpus is benchmark-heavy, but benchmark validity and deployment transfer remain open questions.

Evidence Base

This page was generated from the batch ingest of SRC-20260702-raw-papers-batch, SRC-20260702-raw-whitepapers-batch, and SRC-20260702-raw-news-batch. Treat it as a navigation and synthesis page; promote individual statements into claim pages when they become decision-relevant.

Representative Sources

Title	Kind	Date	Tags	Raw
REALM: A Unified Red-Teaming Benchmark for Physical-World VLMs	paper	2026-06-23	ai-security, benchmark, multimodal-security, red-teaming, vlm	raw
GLM 5.2 on CyberBT-CTF: The strongest open source contender to Anthropic/OpenAI we have tested	industry_blog	2026-06-23	ai-security, cyber-benchmarks, glm-5-2, model-distillation, open-weight-models	raw
We have Mythos at Home: GLM 5.2 beats Claude in our Cyber Benchmarks	industry_blog	2026-06-22	ai-security, cyber-benchmarks, glm-5-2, idor, vulnerability-detection	raw
SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents	paper	2026-06-16	Chao Xu, Hanting Chen, Haocheng Mei, Mengyu Zheng, Xinghao Chen, Ye Yuan	raw
Benchmarking Security Risk Detection and Verification in Open Agentic Skill Ecosystems	paper	2026-05-30	Ismail Hossain, Nan Jiang, Sai Puppala, Sajedul Talukder, Zhuoran Lu, agent-skills	raw
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks	paper	2026-02	agent-skills, benchmark, self-evolving-agents, self-generated-skills, skillsbench	raw
SEC-bench: Automated Benchmarking of LLM Agents on Real-World Security Tasks	paper	2025-06	ai-for-security, benchmark, llm-agents, security-tasks, vulnerability-reproduction	raw
WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks	paper	2025-04-22	Aaron Grattafiori, Arman Zharmagambetov, Chuan Guo, Ivan Evtimov, Kamalika Chaudhuri, benchmark	raw
CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities	paper	2025-03	ai-for-security, benchmark, cve-bench, exploit-evaluation, llm-agents, web-security	raw
CyberSecEval 4	whitepaper	2025	ai-for-security, autopatchbench, benchmark, cyberseceval, vulnerability-patching	raw
CVE-Bench: Benchmarking LLM-based Software Engineering Agents' Ability to Fix Real-world Vulnerabili	paper	2025	ai-for-security, benchmark, cve-bench, software-engineering-agents, vulnerability-repair	raw
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM Agents	paper	2024-10-03	agents, ai-security, benchmark, defenses, memory-poisoning, prompt-injection	raw

Open Questions

Which evidence in this cluster is strongest enough to support a stable claim?
Which results generalize across models, tools, and deployment settings?
Which controls have been evaluated under realistic adversarial conditions?

Evaluation Benchmarks for AI Security

Synthesis

Evidence Base

Representative Sources

Open Questions

Related