Quantization

1 minute read

Table of contents

WHAT IS QUANTIZATION?

Quantization refers to techniques for doing both computations and memory accesses with lower precision data, usually int8 compared to floating point implementations. This enables performance gains in several important areas:

4x reduction in model size; 2-4x reduction in memory bandwidth; 2-4x faster inference due to savings in memory bandwidth and faster compute with int8 arithmetic (the exact speed up varies depending on the hardware, the runtime, and the model).

image1

The numeric data types used in the modern computing systems

How is numeric data represented in modern computing systems?

Integer

  • Unsigned Integer
    • n-bit Range: [0, 2^n − 1]
  • Signed Integer
    • Sign-Magnitude Representation
      • n-bit Range: [−2^n−1 − 1, 2^n−1 − 1]
      • Both 000…00 and 100…00 represent 0
  • Two’s Complement Representation
    • n-bit Range: [−2^n−1 , 2^n−1 − 1]
    • 000…00 represents 0
    • 100…00 represents −2^n−1

Fixed-Point Number

image2

Floating-Point Number

image3

The basic concept of neural network quantization

image4

Three types of common neural network quantization

1. K-Means-based Quantization

2. Linear Quantization

How should we get the optimal linear quantization parameters (S, Z)?

Post-Training Quantization (PTQ): quantizes a floating-point neural network model, including: channel quantization, group quantization, and range clipping.

  • Topic I: Quantization Granularity
    • Per-Tensor Quantization
    • Per-Channel Quantization
    • Group Quantization
      • Per-Vector Quantization
      • Shared Micro-exponent (MX) data type
  • Topic II: Dynamic Range Clipping
  • Topic III: Rounding

Quantization-Aware Training (QAT): emulates inference-time quantization during the training/fine-tuning and recover the accuracy. Train the model taking quantization into consideration

  • A full precision copy of the weights is maintained throughout the training.
  • The small gradients are accumulated without loss of precision.
  • Once the model is trained, only the quantized weights are used for inference.

3. Binary and Ternary Quantization

automatic mixed-precision quantization.

Quantization

  • Symmetric Quantization
  • Asymmetric Quantization
  • Granularity
    • e.g., convolutional neural networks
      • Layerwise Quantization
      • Groupwise Quantization
      • Channelwise Quantization
      • Sub-channelwise Quantization
  • Static and Dynamic Quantization

References

TinyML and Efficient Deep Learning Computing, 6.5940, Fall 2023 10.3.4 Quantization - Machine Learning Systems with tinyml Quantization for Neural Networks

QUANTIZATION RECIPE, pytorch Introduction to Quantization on PyTorch

딥러닝의 Quantization (양자화)와 Quantization Aware Training 딥러닝 Quantization(양자화) 정리

Leave a comment