Efficient Deep Learning Computing
Table of contents
Parallel Computing Techniques
Parallel Computing Techniques to enhance computing speed and reduce memory usage
- Loop optimization: Optimize locality and reduce branching overhead
- Loop reordering: Optimizes locality by reordering the sequence of loops.
- Loop tiling: Reduces memory access by partitioning a loop’s iteration space.
- Loop unrolling: reduces branching overhead at the expense of its binary size.
- SIMD (single instruction, multiple data) programming:
- Performs the same operation on multiple data points simultaneously.
- Multithreading:
- Concurrent execution of multiple threads within a single process.
- CUDA programming:
- Use GPUs to accelerate computation.
Loop optimization
Loop reordering
Improve data locality
e.g.,
- Before reordering.
for i in range(0, N):
for j in range(0, N):
for k in range(0, N):
C[i][j] += A[i][k] * B[k][j]
- After reordering.
for i in range(0, N):
for k in range(0, N):
for j in range(0, N):
C[i][j] += A[i][k] * B[k][j]
Loop tiling
Reduce cache miss
- Before tiling
for i in range(0, N):
for k in range(0, N):
for j in range(0, N):
C[i][j] += A[i][k] * B[k][j]
- After tiling
Tj = Tk = Ti = TILE_SIZE
for i_t in range(0, N, Ti):
for k_t in range(0, N, Tk):
for j_t in range(0, N, Tj):
for i in range(i_t, i_t + Ti):
for k in range(k_t, k_t + Tk):
for j in range(j_t, j_t + Tj):
C[i][j] += A[i][k] * B[k][j]
The tile size can be determined according to the cache size
- Fitting accessed data into cache -> reuse data in cache -> cache miss ↓
여기서 더 확장하여 Multilevel cache를 고려한 tiling 기법도 있다.
Loop unrolling
Reduce branching overheads
- Overheads of loop control
- Arithmetic operations for pointers (e.g., i, j, k)
- End of loop test (e.g., k < N)
- Branch prediction
- Reducing the overheads by loop unrolling
- Replicate the loop body a number of times
- Tradeoff between binary size and reduced overheads
- Before unrolling
for i in range(0, N):
for j in range(0, N):
for k in range(0, N):
C[i][j] += A[i][k] * B[k][j]
e.g., unroll by 4
- After unrolling
for i in range(0, N):
for j in range(0, N):
for k in range(0, N, 4): # step 1->4
C[i][j] += A[i][k] * B[k][j]
C[i][j] += A[i][k+1] * B[k+1][j]
C[i][j] += A[i][k+2] * B[k+2][j]
C[i][j] += A[I][k+3] * B[k+3][j]
SIMD (Single Instruction Multiple Data)
A parallel processing paradigm that applies a single instruction to multiple data elements simultaneously. Commonly used in modern processors to exploit data-level parallelism.
- Key Features:
- Vector Registers:
- Specialized registers that can hold and process multiple data elements.
- Vector Operations:
- Arithmetic and logical operations that work on entire vectors.
- Vector Registers:
- With SIMD programming, we can
Increase computational throughput and speed
- Improve energy efficiency
Multithreading
멀티 스레드를 사용하여 병렬 처리를 하는 기법.
CUDA programming
계산을 가속하기 위해 GPU를 사용하는 기법.
Inference Optimizations
- Image to Column (Im2col) convolution:
- Rearranges input data to directly utilize matrix multiplication kernels.
- In-place depth-wise convolution:
- Reuse the input buffer to write the output data, so as to reduce peak SRAM memory.
- NHWC for point-wise convolution, and NCHW for depth-wise convolution:
- Exploit the appropriate data layout for different types of convolution.
- Winograd convolution:
- Reduce the number of multiplications to enhance computing speed for convolution.
Image to Column
Im2col is a technique to convert the image in a form such that Generalized Matrix Multiplication (GEMM) calls for dot products.
- Pro:
- Utilize GEMM for convolution
- Con:
- Require additional memory space.
- The implicit GEMM can solve the additional memory problem.
- A variant of direct convolution, and operates directly on the input weight and activation tensors. Link
In-place depth-wise convolution
Inverted Residual Block
- Many popular neural network models, such as MobileNetV2, have “inverted residual blocks” with depth-wise convolutions which reduce model size and FLOPs, but significantly increase peak memory (3-6x).
Reduce peak memory
NHWC for point-wise convolution, and NCHW for depth-wise convolution
Winograd convolution
Reference
TinyML and Efficient Deep Learning Computing, 6.5940, Song Han, Fall, 2023 Full-Stack, GPU-based Acceleration of Deep Learning
Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better
Leave a comment