Custom FP4 CUDA Kernel – 129 Tflops on DGX Spark with Pre-Quantized Weight Cacheforums.developer.nvidia.com2 pointsvkaufmann4 months ago