Why do LLM outputs get worse even when metrics stay stable? [pdf]huggingface.co4 pointsscaledsystems2 months ago