are OpenBLAS and MKL not well optimized lol? They literally compared against OpenBLAS/MKL and posted the results in the article. As someone already mentioned, this implementation is faster than MKL even on a Intel Xeon with 96 cores. Maybe you missed the point, but the purpose of the arcticle was to show HOW to implement matmul without FORTRAN/ASSEMBLY code with NumPy-like performance. NOT how to write a BLIS-competitive library. So the article and the code seem to be LGTM.