Fp8 runs ~100 tflops faster when the kernel name has "cutlass" in itgithub.com/triton-lang4 pointsmmastraca year ago