Implementing Matrix Multiplication on CPU from Scratch

The goal of this project isn’t to write a competitive BLAS implementation, but rather to learn about performance optimization. Starting with a naive approach, I applied various techniques step-by-step to significantly improve performance. I was assisted by an AI coding assistant during this process.

The complete code can be found at the link below:

Matmul implementation from scratch

References

I’ve curated a list of excellent articles to help learn these concepts. More details can be found in the links below:

Fast Multidimensional Matrix Multiplication on CPU from Scratch (Simon Boehm, 2022)
- Simon is a performance engineer at Anthropic. His article covers most of the optimizations discussed here. Highly recommended for its concise style and excellent illustrations.
Optimizing Matrix Multiplication: Discovering Optimizations One at a Time (Michal Pitr, 2025)
Matrix Multiplication Deep Dive: Cache Blocking, SIMD & Parallelization (Aliaksei Sala, CppCon 2025)
MIT 6.172: Performance Engineering of Software Systems
- A fantastic introduction to performance engineering. I audited this course while writing this post.
how-to-optimize-gemm
- How to Optimize GEMM Wiki
OpenBLAS - Highly optimized open-source kernels.
BLISlab: A Sandbox for Optimizing GEMM (Jianyu Huang, Meta)
GEMM: From Pure C to SSE Optimized Micro Kernels