I built this because I needed to benchmark LLM inference endpoints and the existing tools required Python environments. I wanted a single binary I could grab quickly on any server.
I've also become interested in performance metrics like time to first token, inter-token latency, throughput, and wanted a tool focused on just that.
llmnop is written in Rust and was initially modeled after LLMPerf, which was archived last month. LLMPerf predates reasoning models and doesn't handle them correctly.
This release adds support for reasoning models like DeepSeek-R1, Qwen3, and gpt-oss. It now separates reasoning tokens from output tokens so your metrics actually mean something.
Previous discussion: https://news.ycombinator.com/item?id=44565477