Show HN: Llmfao – Human-Ranked LLM Leaderboard with Sixty Models

dustalov.github.io

2 points

3 years ago

In September 2023, I noticed a tweet [1] on difficulties with LLM evaluation, which resonated with me a lot. A bit later, I spotted a nice LLMonitor Benchmarks dataset [2] with a small set of prompts and a large set of model completions. I decided to make my attempt without running a comprehensive suite of hundreds of benchmarks: https://dustalov.github.io/llmfao/

I also wrote a detailed post describing the methodology and analysis: https://evalovernite.substack.com/p/llmfao-human-ranking

[1]: https://twitter.com/_jasonwei/status/1707104739346043143

[2]: https://benchmarks.llmonitor.com/

Unfortunately, I did my analysis before the Mistral AI model was released, but published it after the model was released. I’d be happy to add it to the comparison if I had their completions.

2 comments

2 comments