Efficient Deep Learning Computing

3 minute read

Table of contents

Parallel Computing Techniques

Parallel Computing Techniques to enhance computing speed and reduce memory usage

  • Loop optimization: Optimize locality and reduce branching overhead
    • Loop reordering: Optimizes locality by reordering the sequence of loops.
    • Loop tiling: Reduces memory access by partitioning a loop’s iteration space.
    • Loop unrolling: reduces branching overhead at the expense of its binary size.
  • SIMD (single instruction, multiple data) programming:
    • Performs the same operation on multiple data points simultaneously.
  • Multithreading:
    • Concurrent execution of multiple threads within a single process.
  • CUDA programming:
    • Use GPUs to accelerate computation.

Loop optimization

Loop reordering

Improve data locality

e.g.,

  • Before reordering.
for i in range(0, N):
  for j in range(0, N):
    for k in range(0, N):
       C[i][j] += A[i][k] * B[k][j]
  • After reordering.
for i in range(0, N):
  for k in range(0, N):
    for j in range(0, N):
       C[i][j] += A[i][k] * B[k][j]

Loop tiling

Reduce cache miss

  • Before tiling
for i in range(0, N):
  for k in range(0, N):
    for j in range(0, N):
      C[i][j] += A[i][k] * B[k][j]
  • After tiling
Tj = Tk = Ti = TILE_SIZE
  for i_t in range(0, N, Ti):
    for k_t in range(0, N, Tk):
      for j_t in range(0, N, Tj):
        for i in range(i_t, i_t + Ti):
          for k in range(k_t, k_t + Tk):
            for j in range(j_t, j_t + Tj):
              C[i][j] += A[i][k] * B[k][j]

The tile size can be determined according to the cache size

  • Fitting accessed data into cache -> reuse data in cache -> cache miss ↓

여기서 더 확장하여 Multilevel cache를 고려한 tiling 기법도 있다.

Loop unrolling

Reduce branching overheads

  • Overheads of loop control
    • Arithmetic operations for pointers (e.g., i, j, k)
    • End of loop test (e.g., k < N)
    • Branch prediction
  • Reducing the overheads by loop unrolling
    • Replicate the loop body a number of times
    • Tradeoff between binary size and reduced overheads
  • Before unrolling
for i in range(0, N):
  for j in range(0, N):
    for k in range(0, N):
      C[i][j] += A[i][k] * B[k][j]

e.g., unroll by 4

  • After unrolling
for i in range(0, N):
  for j in range(0, N):
    for k in range(0, N, 4): # step 1->4
      C[i][j] += A[i][k] * B[k][j]
      C[i][j] += A[i][k+1] * B[k+1][j]
      C[i][j] += A[i][k+2] * B[k+2][j]
      C[i][j] += A[I][k+3] * B[k+3][j]

SIMD (Single Instruction Multiple Data)

A parallel processing paradigm that applies a single instruction to multiple data elements simultaneously. Commonly used in modern processors to exploit data-level parallelism.

  • Key Features:
    • Vector Registers:
      • Specialized registers that can hold and process multiple data elements.
    • Vector Operations:
      • Arithmetic and logical operations that work on entire vectors.
  • With SIMD programming, we can Increase computational throughput and speed
    • Improve energy efficiency

Multithreading

멀티 스레드를 사용하여 병렬 처리를 하는 기법.

CUDA programming

계산을 가속하기 위해 GPU를 사용하는 기법.

Inference Optimizations

  • Image to Column (Im2col) convolution:
    • Rearranges input data to directly utilize matrix multiplication kernels.
  • In-place depth-wise convolution:
    • Reuse the input buffer to write the output data, so as to reduce peak SRAM memory.
  • NHWC for point-wise convolution, and NCHW for depth-wise convolution:
    • Exploit the appropriate data layout for different types of convolution.
  • Winograd convolution:
    • Reduce the number of multiplications to enhance computing speed for convolution.

Image to Column

Im2col is a technique to convert the image in a form such that Generalized Matrix Multiplication (GEMM) calls for dot products.

  • Pro:
    • Utilize GEMM for convolution
  • Con:
    • Require additional memory space.
    • The implicit GEMM can solve the additional memory problem.
      • A variant of direct convolution, and operates directly on the input weight and activation tensors. Link

In-place depth-wise convolution

Inverted Residual Block

  • Many popular neural network models, such as MobileNetV2, have “inverted residual blocks” with depth-wise convolutions which reduce model size and FLOPs, but significantly increase peak memory (3-6x).

Reduce peak memory

NHWC for point-wise convolution, and NCHW for depth-wise convolution

Winograd convolution

Reference

TinyML and Efficient Deep Learning Computing, 6.5940, Song Han, Fall, 2023 Full-Stack, GPU-based Acceleration of Deep Learning

Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better

https://jjeamin.github.io/posts/ModelCompression/

Leave a comment