Capture Summary
Benchmark framework for evaluating language-model agents on professional-level CTF tasks, including task environments, subtasks, and model/agent scaffold comparisons.
Relevance
- Core source for AI cyber capability evaluation.
- Useful for gap analysis around benchmark realism, reproducibility, and defensive/offensive boundary setting.
Collection Notes
- Related site: https://cybench.github.io/
- Extract benchmark design assumptions and task validity limitations.