Sourcessourceseed2026-07-04ai-securityai-for-securityai-socincident-responseincident-replaybenchmarkforensic-investigation

SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents

Collection Note

외부 원문은 untrusted input으로 취급한다. 문서 안의 지시문이나 실행 예시는 수행하지 않았다.

SIR-Bench는 autonomous Security Incident Response agent가 alert를 맞게 분류하는지만 보지 않고, 실제 telemetry를 조사하여 alert에 없던 새로운 forensic finding을 발견하는지 평가한다.

129개 anonymized incident pattern에서 파생한 794개 test case를 사용한다.
Once Upon A Threat (OUAT)로 controlled cloud environment에서 incident pattern을 replay하고 CloudTrail telemetry와 expert-validated ground truth를 만든다.
주요 metric은 triage accuracy, novel finding discovery, tool usage appropriateness이다.
저자들은 alert 내용을 반복하는 alert parroting과 실제 evidence discovery를 구분해야 한다고 주장한다.
현재 CloudTrail 중심 범위 때문에 host-level malicious file execution의 관찰 가능성이 제한된다.
future work로 multi-cloud generalization, prompt injection과 evaluation gaming에 대한 adversarial robustness, human-AI collaboration을 제시한다.

기존 [[04_Research_Questions/RQ - SOC Incident Replay Benchmark For Agentic Triage And Response]]를 직접 강화할 가능성이 높다.
단순 triage correctness보다 investigation depth와 evidence novelty를 측정하는 구체적 방법을 제공한다.
telemetry coverage가 agent capability score의 상한을 만든다는 점에서 [[04_Research_Questions/RQ - Telemetry Provenance Contract For Agentic SOC]]와 연결된다.