skidrow

Born on July 02, 2024•388 Karma

on July 11, 2024

•on: Beating NumPy matrix multiplication in 150 lines o...

SIMD intrinsics and manually unrolled loops are surely needed. That's the reason why all BLAS libraries vectorize and unroll loops manually. Even modern compilers can't properly auto-vectorize and unroll with 100% success rate.

skidrow•

on July 11, 2024

•on: Beating NumPy matrix multiplication in 150 lines o...

are OpenBLAS and MKL not well optimized lol? They literally compared against OpenBLAS/MKL and posted the results in the article. As someone already mentioned, this implementation is faster than MKL even on a Intel Xeon with 96 cores. Maybe you missed the point, but the purpose of the arcticle was to show HOW to implement matmul without FORTRAN/ASSEMBLY code with NumPy-like performance. NOT how to write a BLIS-competitive library. So the article and the code seem to be LGTM.

skidrow•

on July 11, 2024

•on: Beating NumPy matrix multiplication in 150 lines o...

Look, it's indeed a resonable comparison. They use matrix sizes up to M=N=K=5000, so the ffi overhead is neglectable. What's the point of compairing NumPy with BLAS if NumPy does use BLAS under the hood?

skidrow•

on July 11, 2024

•on: Beating NumPy matrix multiplication in 150 lines o...

Their implementation outperforms not only the recent version of OpenBLAS but also MKL on their machine (these are DEFAULT BLAS libraries shipped with numpy). What's the point of compairing against BLIS if numpy doesn't use it by default? The authors explicitly say: "We compare against NumPy". They use matrix sizes up to M=N=K=5000. So the ffi overhead is, in fact, neglectable.