In the recent years, there has been an explosion in the ML applications to real-world problems. The success of the models is largely caused by overparametrization — generally, the more parameters you have, the better your model performs. But the issue comes during the inference, where all parameters have to be loaded into memory to obtain the predictions. For example, the smallest LLaMA in float32 takes 28GB of space, just to be stored on a disk.
Thus, one should explore various approaches to compressing the model while keeping the performance at the desired rate. One of these approaches is quantization, which converts the parameters of the network into lower precision.
In this topic, we will look at the main ideas behind quantization and how it can be applied in practice.
The setup
Quantization refers to the process of reducing the precision of weights and activations from floating-point numbers (e.g., 32-bit) to lower-precision data types, such as 8-bit or even lower. This compressing results in models requiring less memory and increased speed of matrix arithmetic. The two most common conversions are float32 -> float16 and float32 -> int8.
The quantized type is the data type in which the tensors are stored after the quantization (e.g., float16 or int8). Another type is the accumulation (or computation) data type, the data type in which the results of the addition/multiplication of the quantized tensors exist), which typically has a higher precision (e.g., if you take int8 and perform addition of , the resulting range will exceed the int8 range, leading to a significant precision loss). The lower (quantized) types and their corresponding accumulation (computation) types are typically given as:
float16, accumulation data typefloat16bfloat16, accumulation data typefloat32int16, accumulation data typeint32int8, accumulation data typeint32
The specific choice usually comes down to the hardware and related requirements (for example, some embedded devices can only perform operations on integers).
The quantization can be also categorized in time: post-training quantization and quantization-aware training. The latter is more complicated in terms of implementation, but tends to result in better performance.
Symmetric and asymmetric quantization
In this section, we will introduce the two main modes: the asymmetric and the symmetric quantization. It's worth mentioning that other schemes, such as k-means based quantization and factorization-based quantization have been the subject of research, but symmetric and asymmetric modes are the most widespread settings at this point.
The process of the asymmetric and symmetric quantization is illustrated below:
Asymmetric quantization maps the floating point numbers from into , where is the number of bits in the quantized version (for example, if , the floating point numbers will be represented in the range), and and are the minimum and the maximum floating point numbers in the original tensor.
The transformation of the asymmetric quantization is given as follows:
where
is the original floating point number;
is the quantization scale parameter, and defined as
is the zero point (corresponds to 0) and is typically stored in the quantized type
The de-quantization for asymmetric quantization is given as
Asymmetric quantization is particularly suitable for asymmetric distributions (for example, the ReLU activation).
Symmetric quantization allows to map a series of floating points from the range into (again, if the final representation has 8 bits, the floating points will be mapped into ). here corresponds to the highest absolute value in the input. The transformation looks like
where
is the scale parameter. There is no custom zero point mapping in symmetric quantization. The de-quantization in the symmetric case is simply
Symmetric quantization is simpler and more computationally efficient than asymmetric quantization, but doesn't work as well with the skewed data distributions (such as activations) and (in general) results in lower accuracy. The two types could be combined, where activations and the input will be quantized in the asymmetric setting, and the weights will be quantized in the symmetric one.
The range: setting the and
Up until this point, we have defined and as minimum and maximum in the asymmetric case and as the highest absolute value in the symmetric case, but the range setting is quite not as straightforward in practice and there are various range schemes available. The process of setting the range is often referred to as calibration. Calibration for post-training quantization can be done in two ways: dynamic (the ranges for the activations are computed at runtime) and static (the range for each activation is computed at quantization-time, typically by passing data through the model and recording the activation values). The ranges for the weights are known at runtime.
Post-training quantization with static calibration can be presented as
Here, the observers are the components that record the values of the activations. There are usually around 200 unlabeled samples that are used for calibration, and they should be representative of the data (otherwise, the quality will degrade).
There are multiple schemes to choose the range values in the static-calibrated post-training quantization and quantization-aware training. Let's elaborate on the already mentioned min-max. Formally, to cover the range of values, we set and , where is the original tensor. Min-max range setting is very sensitive to outliers and will introduce a large error during de-quantization.
There is a better approach: setting the range to the percentile of the distribution of , which will significantly reduce the outlier sensitivity when compared to the min-max range setting.
Another approach is to use mean-squared error, where the objective is to minimize the MSE between the full-precision and the quantized tensors, and is usually performed with grid search or a similar technique. Cross-entropy is somewhat similar to MSE, but instead of minimizing the MSE between the full-precision and the quantized version, as the name suggests, the cross-entropy is being minimized.
Quantization granularity
The next issue is related to how the parameters, namely, the scale and the zero-point, are chosen. If and are chosen universally, it will result in performance degradation. This is referred to as the quantization granularity. Mainly, the parameters can be chosen either on the per-tensor basis, where each tensor has a single pair of values associated with it, or on the per-channel basis, where a pair of is attached to a single element along a single dimension of the tensor. The per-channel mode results in higher accuracy, but consumes more memory than the per-tensor mode.
Quantization-aware training (QAT)
Quantization-aware training refers to the setting where fake quantization modules are inserted into the computation graph during the training. The process can be roughly outlined as follows:
During the forward pass of training, the weights and activations are quantized to the low-precision types using simulated quantization operations. This typically involves:
Quantizing the weights using the asymmetric/ symmetric quantization;
Quantizing the activations using the previously mentioned settings for range selection (MSE-based, percentile-based, etc).
In the forward pass, stochastic rounding is used, where instead of deterministically rounding the quantized values to the nearest integer, for each real-valued input, the fractional part is used to determine the probability of rounding up or down. Then, a random number is generated, and if it is less than the fractional part, the quantized value is rounded up; otherwise, it's rounded down. This is mainly done to reduce the quantization bias.
Since the quantization operations are non-differentiable, a straight-through estimator (STE) is used during the backward pass to approximate the gradients. The STE passes the gradients from the output of the quantization operation directly to its input, ignoring the quantization operation itself.
The quantization parameters (scale and zero-point) are treated as learnable parameters and updated during training using gradient descent, along with the model weights.
Quantization-aware training tends to give better accuracy and be more robust than post-training quantization.
Conclusion
As a result, this topic can be summarized as follows:
Quantization is the process of dropping the precision of the weights and activations of the networks to increase inference time and reduce memory requirements;
Two main modes of quantization in terms of time are post-training quantization and quantization-aware training, where the latter tends to result in better performance;
Symmetric and asymmetric quantization determine the scheme of quantization, with asymmetric quantization resulting in higher accuracy and being more suitable for asymmetric data distributions (such as activation functions).