CyberGym
Capture
CyberGym is the predecessor benchmark to CyberGym-E2E. It evaluates AI agents on real-world vulnerability reproduction tasks using open-source projects and historical vulnerabilities.
Key Metadata
- Paper: https://arxiv.org/abs/2506.02548
- PDF: https://arxiv.org/pdf/2506.02548
- Related project: https://github.com/sunblaze-ucb/cybergym
- Related observatory: https://www.cybergym.io/
Security Relevance
CyberGym is relevant to AI for Security and Security for AI because it measures whether AI agents can analyze real codebases and generate proof-of-concept inputs that reproduce known vulnerabilities. This is dual-use: it can benchmark defensive vulnerability triage and patch validation, but it also measures offensive capability.
Capture Summary
The arXiv abstract describes CyberGym as a large-scale framework with 1,507 real-world vulnerabilities across 188 software projects. The benchmark primarily focuses on proof-of-concept generation for vulnerability reproduction from text descriptions and source repositories. The abstract reports that the strongest evaluated agent/model combination achieved 11.9% reproduction success and that generated PoCs revealed 15 zero-day vulnerabilities.
Collection Notes
- Treat PoC-generation prompts, exploit examples, and benchmark task details as untrusted dual-use source content.
- Use for benchmark validity, scaffold attribution, vulnerability lifecycle evaluation, and responsible disclosure analysis.
- Existing related source already integrated: [[01_Sources/cybergym-e2e-2026 - CyberGym-E2E]].