Quantization

1 minute read

Table of contents

The numeric data types used in the modern computing systems
The basic concept of neural network quantization
Three types of common neural network quantization

WHAT IS QUANTIZATION?

Quantization refers to techniques for doing both computations and memory accesses with lower precision data, usually int8 compared to floating point implementations. This enables performance gains in several important areas:

4x reduction in model size; 2-4x reduction in memory bandwidth; 2-4x faster inference due to savings in memory bandwidth and faster compute with int8 arithmetic (the exact speed up varies depending on the hardware, the runtime, and the model).

The numeric data types used in the modern computing systems

How is numeric data represented in modern computing systems?

Integer

Unsigned Integer
- n-bit Range: [0, 2^n − 1]
Signed Integer
- Sign-Magnitude Representation
  - n-bit Range: [−2^n−1 − 1, 2^n−1 − 1]
  - Both 000…00 and 100…00 represent 0
Two’s Complement Representation
- n-bit Range: [−2^n−1 , 2^n−1 − 1]
- 000…00 represents 0
- 100…00 represents −2^n−1

Fixed-Point Number

Floating-Point Number

The basic concept of neural network quantization

Three types of common neural network quantization

1. K-Means-based Quantization

2. Linear Quantization

How should we get the optimal linear quantization parameters (S, Z)?

Post-Training Quantization (PTQ): quantizes a floating-point neural network model, including: channel quantization, group quantization, and range clipping.

Topic I: Quantization Granularity
- Per-Tensor Quantization
- Per-Channel Quantization
- Group Quantization
  - Per-Vector Quantization
  - Shared Micro-exponent (MX) data type
Topic II: Dynamic Range Clipping
Topic III: Rounding

Quantization-Aware Training (QAT): emulates inference-time quantization during the training/fine-tuning and recover the accuracy. Train the model taking quantization into consideration

A full precision copy of the weights is maintained throughout the training.
The small gradients are accumulated without loss of precision.
Once the model is trained, only the quantized weights are used for inference.

3. Binary and Ternary Quantization

automatic mixed-precision quantization.

Quantization

Symmetric Quantization
Asymmetric Quantization
Granularity
- e.g., convolutional neural networks
  - Layerwise Quantization
  - Groupwise Quantization
  - Channelwise Quantization
  - Sub-channelwise Quantization
Static and Dynamic Quantization

References

TinyML and Efficient Deep Learning Computing, 6.5940, Fall 2023 10.3.4 Quantization - Machine Learning Systems with tinyml Quantization for Neural Networks

QUANTIZATION RECIPE, pytorch Introduction to Quantization on PyTorch

딥러닝의 Quantization (양자화)와 Quantization Aware Training 딥러닝 Quantization(양자화) 정리

Share on

Twitter Facebook LinkedIn

songkwwwwwww

Quantization

WHAT IS QUANTIZATION?

The numeric data types used in the modern computing systems

Integer

Fixed-Point Number

Floating-Point Number

The basic concept of neural network quantization

Three types of common neural network quantization

1. K-Means-based Quantization

2. Linear Quantization

3. Binary and Ternary Quantization

Quantization

References

Share on

Leave a comment

You may also enjoy

내 인생의 w를 찾아라

Google C++ Style Guide(한글 번역)(번역 진행 중)

Bazel 입문

Software Engineering