Fp8 runs ~100 tflops faster when the kernel name has "cutlass" in itgithub.com/triton-lang338 pointsmmastrac9 months ago