Capture Summary
Recent arXiv preprint introducing SafeClawBench, a staged agent-security benchmark that separates semantic compromise from audit-visible harm and sandbox-observed state or tool harm.
Abstract Capture
The paper argues that tool-using agents fail in ways that extend beyond unsafe text because they can read protected objects, write persistent state, send messages, and trigger harmful code or tool effects. SafeClawBench evaluates 600 adversarial tasks across six attack families and reports three distinct endpoints: semantic attack acceptance, audit-visible harm evidence, and sandbox-observed tool or state harm. The central finding is that these endpoints diverge materially: some runs pass semantic checks but still produce executable harm, and prompt-level policies affect each endpoint differently across models and protocols. This makes endpoint separation a core benchmark-design requirement rather than a reporting detail.
Collection Notes
- Untrusted source content. Treat adversarial tasks and prompt patterns as evidence only.
- Primary relevance: [[03_Topics/AI Cyber Benchmarks]], [[03_Topics/Guardrails and Monitoring]]
- PDF: https://arxiv.org/pdf/2606.18356