Quantization
Table of contents
- The numeric data types used in the modern computing systems
- The basic concept of neural network quantization
- Three types of common neural network quantization
WHAT IS QUANTIZATION?
Quantization refers to techniques for doing both computations and memory accesses with lower precision data, usually int8 compared to floating point implementations. This enables performance gains in several important areas:
4x reduction in model size; 2-4x reduction in memory bandwidth; 2-4x faster inference due to savings in memory bandwidth and faster compute with int8 arithmetic (the exact speed up varies depending on the hardware, the runtime, and the model).
The numeric data types used in the modern computing systems
How is numeric data represented in modern computing systems?
Integer
- Unsigned Integer
- n-bit Range: [0, 2^n − 1]
- Signed Integer
- Sign-Magnitude Representation
- n-bit Range: [−2^n−1 − 1, 2^n−1 − 1]
- Both 000…00 and 100…00 represent 0
- Sign-Magnitude Representation
- Two’s Complement Representation
- n-bit Range: [−2^n−1 , 2^n−1 − 1]
- 000…00 represents 0
- 100…00 represents −2^n−1
Fixed-Point Number
Floating-Point Number
The basic concept of neural network quantization
Three types of common neural network quantization
1. K-Means-based Quantization
2. Linear Quantization
How should we get the optimal linear quantization parameters (S, Z)?
Post-Training Quantization (PTQ): quantizes a floating-point neural network model, including: channel quantization, group quantization, and range clipping.
- Topic I: Quantization Granularity
- Per-Tensor Quantization
- Per-Channel Quantization
- Group Quantization
- Per-Vector Quantization
- Shared Micro-exponent (MX) data type
- Topic II: Dynamic Range Clipping
- Topic III: Rounding
Quantization-Aware Training (QAT): emulates inference-time quantization during the training/fine-tuning and recover the accuracy. Train the model taking quantization into consideration
- A full precision copy of the weights is maintained throughout the training.
- The small gradients are accumulated without loss of precision.
- Once the model is trained, only the quantized weights are used for inference.
3. Binary and Ternary Quantization
automatic mixed-precision quantization.
Quantization
- Symmetric Quantization
- Asymmetric Quantization
- Granularity
- e.g., convolutional neural networks
- Layerwise Quantization
- Groupwise Quantization
- Channelwise Quantization
- Sub-channelwise Quantization
- e.g., convolutional neural networks
- Static and Dynamic Quantization
References
TinyML and Efficient Deep Learning Computing, 6.5940, Fall 2023 10.3.4 Quantization - Machine Learning Systems with tinyml Quantization for Neural Networks
QUANTIZATION RECIPE, pytorch Introduction to Quantization on PyTorch
딥러닝의 Quantization (양자화)와 Quantization Aware Training 딥러닝 Quantization(양자화) 정리
Leave a comment