We published LABE, a public benchmark for legal AI at the exact point where a system is about to take a real high-impact action.
Current result:
baseline executed 18 unjustified high-impact action points with VerifiedX that dropped to 0 false blocks in the current suite: 0 surviving-goal completion improved from 41.7% to 100% Same harness, same prompts, same playbooks, baseline vs VerifiedX.
Legal is the first public instance. The same method applies to support, healthcare RCM, procurement, and finance too.
Repo, methodology, and raw artifacts are public: https://github.com/bigkan8/legal-action-boundary-eval