To understand quantization it is important to understand about numerical types and the bits used to store each of these numeric types.
So, in a normal use case without any discussion on quantization, we feed an input to a deep learning model and get a result back out, this simple sentence boils down to a long sequence of vector math operations which are done at the binary level by the eventual processing engine.
In most of these cases, the default data type for these operations (in pytorch) is 32-bit floating point numbers which as they state take 32 bits to store a number. In quantization, we would convert these memory consuming fp32 numericals into 8-bit integers. This would reduce the maths from working fp32, 32 bit numbers to int8, 8 bit numbers. Clearly as int8 has quarter the number of bits as compared to fp32, model inference done using int8 is in a naive manner four times faster.
Floating points need a specification because operating on and storing unbounded numbers is complicated. Integer numbers like 1, -12, or 42, are comparatively simple. An int32, for example, has 1 bit reserved for the sign, and 31 bits for the digits. That means it can store 2^31 = 4294967296 total values, ranging from -2^31 to 2^31 – 1. The same logic holds for an int8: this type holds 2^8 = 256 total values in the range -2^7 = -128 through 2^7 – 1 = 127.
Quantization works by mapping the (many) values possible in fp32 onto the (just 256) values possible in int8. This is done by binning the values: mapping ranges of values in the fp32 space into individual int8 values. For example, two weights constants 1.2251 and 1.6125 in fp32 might both be converted to 12 in int8, because they are both in the bin [1, 2]. Picking the right bins is obviously very important.
In layman terms, quantization involves mapping from input values in a much larger and often continuous set to a comparitively smaller and finite set.
Rounding off and truncation could be a very high level examples of this process.
The initial thought process of how quanitization came up, was that in lossless coding theory using the same number of bits is wasteful, when events of interest have a non-uniform probability. In a more optimal approach we would vary the number of bits based on the probability of a particular event occuring. This is what in modern theory is variable-rate quantization. Huffmann encoding could be a great example of this approach.
Challenges with Quantization in Neural Nets
First issue, clearly when working with neural nets, training and inference is a computationally expensive task, efficient representation of numerical values is of high importance when working with neural nets.
Secondly, most current Neural Net architectures are heavily over-parametrised for various reasons, so there is a huge scope of reducing bit-precision without impacting accuracies and loss functtions to a greater extent.
Also, it is important to understand that, Neural Net architectures are very robust to aggresive quantization and extreme discretization.
A new degree of freedom i.e., has to do work with the number of parameters involved i.e., working with overparametrized models. This would have direct implications on the type of problems we solve are they well-posed problems or are we interested in reducing the forward error or the backward error of the network.
Current research in neural net architectures driven by quantization based optimzation algorithms , have not encountered any sort of well-posed problems, most of the research has been focused on improving some sort of the forward metric ( based upon classification quality) but due to the issue of overparameterization they have been centered around very different models optimizing the same metric. Due to which in some cases ,it is possible to have a high error between a quantized and non-quantized model, while still attaining a generalization performance.
The layered structure of neural networks could be another dimension to work on quantization for neural nets.
Experiment to understand Quantization methods
Problem Setup and Notation
Assume that the NN has L layers with learnable parameters, denoted as {W1, W2, …, WL}, with θ denoting the combination of all such parameters. Without loss of generality, we focus on the supervised learning problem, where the nominal goal is to optimize the following empirical risk minimization function:
where (x, y) is the input data and the corresponding label, l(x, y; θ) is the loss function (e.g., Mean Squared Error or Cross Entropy loss), and N is the total number of data points. Let us also denote the input hidden activations of the i th layer as hi, and the corresponding output hidden activation as ai.
We assume that we have the trained model parameters θ, stored in floating point precision. In quantization, the goal is to reduce the precision of both the parameters (θ), as well as the intermediate activation maps (i.e., hi, ai) to low-precision, with minimal impact on the generalization power/accuracy of the model. To do this, we need to define a quantization operator that maps a floating point value to a quantized one.