Agent-evals: Metacognitive scoring and boundary testing for LLM coding agentsthinkwright.ai2 pointsoceanwaves4 months ago