RQ-20260702-002-benchmark-validity
Which AI security benchmarks are predictive of real deployment risk?
Why It Matters
Benchmarks can drive research effort, but weak benchmark validity can overfit the field to artificial tasks.
Evidence Plan
- Inventory major red-team, jailbreak, prompt injection, and agent security benchmarks.
- Identify benchmark assumptions and evaluation artifacts.
- Compare benchmark results with incidents or deployment findings.