Implementing Matrix Multiplication on CPU from Scratch

Implementing Matrix Multiplication on CPU from Scratch

The goal of this project isn’t to write a competitive BLAS implementation, but rather to learn about performance optimization. Starting with a naive approach, I applied various techniques step-by-step to significantly improve performance. I was assisted by an AI coding assistant during this process.

The complete code can be found at the link below:

Matmul implementation from scratch

References

I’ve curated a list of excellent articles to help learn these concepts. More details can be found in the links below: