I built LLMuxer because I kept defaulting to GPT-4o for everything, even simple tasks where a smaller, cheaper model would have done just fine.
It runs your prompts or dataset (currently for classification tasks) across multiple models, compares performance vs. cost, and recommends the best value so you’re not wasting tokens or budget.
Would love feedback and ideas for features you’d want before using it in your own workflow!