Efficient Deep Learning Computing

3 minute read

Table of contents

Parallel Computing Techniques
Inference Optimizations

Parallel Computing TechniquesPermalink

Parallel Computing Techniques to enhance computing speed and reduce memory usage

Loop optimization: Optimize locality and reduce branching overhead
- Loop reordering: Optimizes locality by reordering the sequence of loops.
- Loop tiling: Reduces memory access by partitioning a loop’s iteration space.
- Loop unrolling: reduces branching overhead at the expense of its binary size.
SIMD (single instruction, multiple data) programming:
- Performs the same operation on multiple data points simultaneously.
Multithreading:
- Concurrent execution of multiple threads within a single process.
CUDA programming:
- Use GPUs to accelerate computation.

Loop optimizationPermalink

Loop reorderingPermalink

Improve data locality

e.g.,

Before reordering.

for i in range(0, N):
  for j in range(0, N):
    for k in range(0, N):
       C[i][j] += A[i][k] * B[k][j]

After reordering.

for i in range(0, N):
  for k in range(0, N):
    for j in range(0, N):
       C[i][j] += A[i][k] * B[k][j]

Loop tilingPermalink

Reduce cache miss

Before tiling

for i in range(0, N):
  for k in range(0, N):
    for j in range(0, N):
      C[i][j] += A[i][k] * B[k][j]

After tiling

Tj = Tk = Ti = TILE_SIZE
  for i_t in range(0, N, Ti):
    for k_t in range(0, N, Tk):
      for j_t in range(0, N, Tj):
        for i in range(i_t, i_t + Ti):
          for k in range(k_t, k_t + Tk):
            for j in range(j_t, j_t + Tj):
              C[i][j] += A[i][k] * B[k][j]

The tile size can be determined according to the cache size

Fitting accessed data into cache -> reuse data in cache -> cache miss ↓

여기서 더 확장하여 Multilevel cache를 고려한 tiling 기법도 있다.

Loop unrollingPermalink

Reduce branching overheads

Overheads of loop control
- Arithmetic operations for pointers (e.g., i, j, k)
- End of loop test (e.g., k < N)
- Branch prediction
Reducing the overheads by loop unrolling
- Replicate the loop body a number of times
- Tradeoff between binary size and reduced overheads
Before unrolling

for i in range(0, N):
  for j in range(0, N):
    for k in range(0, N):
      C[i][j] += A[i][k] * B[k][j]

e.g., unroll by 4

After unrolling

for i in range(0, N):
  for j in range(0, N):
    for k in range(0, N, 4): # step 1->4
      C[i][j] += A[i][k] * B[k][j]
      C[i][j] += A[i][k+1] * B[k+1][j]
      C[i][j] += A[i][k+2] * B[k+2][j]
      C[i][j] += A[I][k+3] * B[k+3][j]

SIMD (Single Instruction Multiple Data)Permalink

A parallel processing paradigm that applies a single instruction to multiple data elements simultaneously. Commonly used in modern processors to exploit data-level parallelism.

Key Features:
- Vector Registers:
  - Specialized registers that can hold and process multiple data elements.
- Vector Operations:
  - Arithmetic and logical operations that work on entire vectors.
With SIMD programming, we can Increase computational throughput and speed
- Improve energy efficiency

MultithreadingPermalink

멀티 스레드를 사용하여 병렬 처리를 하는 기법.

CUDA programmingPermalink

계산을 가속하기 위해 GPU를 사용하는 기법.

Inference OptimizationsPermalink

Image to Column (Im2col) convolution:
- Rearranges input data to directly utilize matrix multiplication kernels.
In-place depth-wise convolution:
- Reuse the input buffer to write the output data, so as to reduce peak SRAM memory.
NHWC for point-wise convolution, and NCHW for depth-wise convolution:
- Exploit the appropriate data layout for different types of convolution.
Winograd convolution:
- Reduce the number of multiplications to enhance computing speed for convolution.

Image to ColumnPermalink

Im2col is a technique to convert the image in a form such that Generalized Matrix Multiplication (GEMM) calls for dot products.

Pro:
- Utilize GEMM for convolution
Con:
- Require additional memory space.
- The implicit GEMM can solve the additional memory problem.
  - A variant of direct convolution, and operates directly on the input weight and activation tensors. Link

In-place depth-wise convolutionPermalink

Inverted Residual Block

Many popular neural network models, such as MobileNetV2, have “inverted residual blocks” with depth-wise convolutions which reduce model size and FLOPs, but significantly increase peak memory (3-6x).

Reduce peak memory

NHWC for point-wise convolution, and NCHW for depth-wise convolutionPermalink

Winograd convolutionPermalink

ReferencePermalink

TinyML and Efficient Deep Learning Computing, 6.5940, Song Han, Fall, 2023 Full-Stack, GPU-based Acceleration of Deep Learning

Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better

https://jjeamin.github.io/posts/ModelCompression/

Share on

Twitter Facebook LinkedIn

songkwwwwwww