Author here. We built this because one AI reviewing code and checking its own work just agrees with itself — the research confirms LLMs can't reliably self-correct on hard reasoning.
The fix: split it into two agents with opposed goals. A reviewer agent finds problems.
A dev agent tries to disprove each finding with specific counter-evidence from the codebase.
Three verdicts: VALID, INVALID, AMBIGUOUS. Only what survives reaches your team.
Agents are auto-generated per service — point the skill at your repo, it scans your stack and produces the full set. ~30 min of human tuning per service to add tribal knowledge. Repo includes the shared preamble (quality constitution), reviewer/dev templates, and the auto-generation skill. Happy to answer questions.