From 800ms to ~25ms: harness-driven optimization of a CUDA matmul kernelgithub.com/YupengHan3 pointsicyace2 months ago