A new CUDA kernel for quantized LLMs achieves up to 2.6x latency improvementsgithub.com/HanGuo972 pointsradichoml2 years ago