I built AgentSafety to test whether autonomous coding agents make the right safety call: allow, ask, or refuse.
v0.1 includes 50 benchmark cases and focuses on practical failure modes like prompt injection, secret access, destructive commands, out-of-workspace writes, dependency installs, and ambiguous intent. It also includes a policy baseline plus reproducible run artifacts and comparison reports.
I’d really value feedback on case quality, labeling/scoring, and what’s missing for real-world agent evaluation.