Implementing Matrix Multiplication on CPU from Scratch
Implementing Matrix Multiplication on CPU from Scratch
The goal of this project isn’t to write a competitive BLAS implementation, but rather to learn about performance optimization. Starting with a naive approach, I applied various techniques step-by-step to significantly improve performance. I was assisted by an AI coding assistant during this process.
The complete code can be found at the link below:
Matmul implementation from scratch
References
I’ve curated a list of excellent articles to help learn these concepts. More details can be found in the links below:
- Fast Multidimensional Matrix Multiplication on CPU from Scratch (Simon Boehm, 2022)
- Simon is a performance engineer at Anthropic. His article covers most of the optimizations discussed here. Highly recommended for its concise style and excellent illustrations.
- Optimizing Matrix Multiplication: Discovering Optimizations One at a Time (Michal Pitr, 2025)
- Matrix Multiplication Deep Dive: Cache Blocking, SIMD & Parallelization (Aliaksei Sala, CppCon 2025)
- MIT 6.172: Performance Engineering of Software Systems
- A fantastic introduction to performance engineering. I audited this course while writing this post.
- how-to-optimize-gemm
- OpenBLAS - Highly optimized open-source kernels.
- BLISlab: A Sandbox for Optimizing GEMM (Jianyu Huang, Meta)
- GEMM: From Pure C to SSE Optimized Micro Kernels