For how lofty Anthropic’s Mythos claims are, the harness is confusingly stupid.
From the report, it ranks every file by “how sus it sounds,” loops over each with curt instructions to “find a bug,” hands candidates to a judge + ASan checker— and zero-days simply pop out.
That should not work.
But it does.
On miniupnp with a $20 plan, Opus 4.6 reliably rediscovers known CVEs in older versions and even surfaced a new remote global buffer overflow (non-default config).
So what happens if the harness is actually good—i.e. equipped with proper security tooling?
I’m a student, not a security engineer, so I'd would love ideas or critiques on my planned tool roadmap. (If you have a $200 plan with extra usage lying around, try it out to see if it churns a zero-day in your own C)