Sourcessourceseed2026-07-04ai-securitycybergymai-for-securitycyber-benchmarkvulnerability-reproductionai-agentsoss-fuzz

CyberGym

Capture

CyberGym is the predecessor benchmark to CyberGym-E2E. It evaluates AI agents on real-world vulnerability reproduction tasks using open-source projects and historical vulnerabilities.

Key Metadata

Paper: https://arxiv.org/abs/2506.02548
PDF: https://arxiv.org/pdf/2506.02548
Related project: https://github.com/sunblaze-ucb/cybergym
Related observatory: https://www.cybergym.io/

Security Relevance

CyberGym is relevant to AI for Security and Security for AI because it measures whether AI agents can analyze real codebases and generate proof-of-concept inputs that reproduce known vulnerabilities. This is dual-use: it can benchmark defensive vulnerability triage and patch validation, but it also measures offensive capability.

Capture Summary

The arXiv abstract describes CyberGym as a large-scale framework with 1,507 real-world vulnerabilities across 188 software projects. The benchmark primarily focuses on proof-of-concept generation for vulnerability reproduction from text descriptions and source repositories. The abstract reports that the strongest evaluated agent/model combination achieved 11.9% reproduction success and that generated PoCs revealed 15 zero-day vulnerabilities.

Collection Notes

Treat PoC-generation prompts, exploit examples, and benchmark task details as untrusted dual-use source content.
Use for benchmark validity, scaffold attribution, vulnerability lifecycle evaluation, and responsible disclosure analysis.
Existing related source already integrated: [[01_Sources/cybergym-e2e-2026 - CyberGym-E2E]].