Benchmark for measuring how well AI agents perform at ML engineeringgithub.com/openai1 pointzerojames2 years ago