Add post training observer_quantizer #3915

chenbohua3 · 2021-07-08T03:37:26Z

This PR adds a post training quantizer ObserverQuantizer. It uses PyTorch official HistogramObserver to calculate quantization information of activation and uses MinMaxObserver for weight. This quantizer has two advantages:

Users can directly get quantization information of a pre-trained model with a little calibration data. Training is not needed.
The quantization information generated by this quantizer can be used to initialize the scale and zero point of QAT quantizer, which will lead to a faster convergence and a better accuracy.

Also, users can easily customize their own observer, as long as the usage of the observer is consistent with that of PyTorch. I will upload our implementation of a classic post-training quantizer (e.g. Kullback-Leibler observer, used by TensorRT) in the next pr.

linbinskn · 2021-07-11T23:54:23Z

examples/model_compress/quantization/observer_quantizer.py

+from nni.algorithms.compression.pytorch.quantization import ObserverQuantizer
+
+
+class Mnist(torch.nn.Module):


Please import model directly from mnist.

linbinskn · 2021-07-12T00:11:27Z

nni/algorithms/compression/pytorch/quantization/quantizers.py

@@ -119,6 +122,217 @@ def quant_backward(tensor, grad_output, quant_type, scale, zero_point, qmin, qma
        return grad_output


+class ObserverQuantizer(Quantizer):
+    """


Please add description of ObserverQuantizer here such as emphasizing some key points that it is an post-training quantization quantizer and it only works in evaluation mode currently.

linbinskn · 2021-07-12T00:45:23Z

nni/algorithms/compression/pytorch/quantization/quantizers.py

+        scale, zero_point = observer.calculate_qparams()
+        return scale, zero_point
+
+    def quantize(self, x, scale, zero_point, qmin, qmax):


For class internal function, recommend using '_' as prefix.

linbinskn · 2021-07-12T00:54:13Z

examples/model_compress/quantization/observer_quantizer.py

+    calibration_config = quantizer.export_model(model_path, calibration_path)
+    print("calibration_config: ", calibration_config)
+
+    # For now the quantization settings of ObserverQuantizer does not match the TensorRT,


Why can't we support TensorRT currently? Is it for reasons that some runtime errors may be raised or the result will not be aligned between simulated quantization and TensorRT?

Because the dtype and qscheme of PyTorch default observer are different with that in TensorRT. For example, TensorRT uses per_tensor_symmetric with uint8 for activation, PyTorch uses per_tensor_affine with quint8 for activation.
When customization of dtype and qscheme is ready, we can support TensorRT.

linbinskn · 2021-07-12T00:55:02Z

examples/model_compress/quantization/observer_quantizer.py

+            model(data)
+
+
+def test_trt(engine, test_loader):


It's a little strange that we keep this function without using it.

Have deleted it

linbinskn · 2021-07-12T00:58:24Z

nni/algorithms/compression/pytorch/quantization/quantizers.py

+    def validate_config(self, model, config_list):
+        schema = CompressorSchema([{
+            Optional('quant_types'): Schema([lambda x: x in ['weight', 'output', 'input']]),
+            Optional('quant_bits'): Or(And(int, lambda n: n == 8), Schema({


For post-training quantization, we support int8 right now. If we want to support all bit type or mixed precision, is there any obstacle?

I think that there would not be major obstacles. The reason why we only support int8 right now is that PyTorch quantization observers only support 8 bit quantization (see here ). To support them, we should extend/customize the observers to support all bit type.

QuanluZhang · 2021-07-18T03:27:22Z

nni/algorithms/compression/pytorch/quantization/quantizers.py

+        # NOTE: this quantizer is experimental for now. The dtype and qscheme of quantization
+        # is hard-coded.
+        # TODO:
+        # 1. support dtype and qscheme customization through config_list. Current settings:


if dtype and qscheme can be applied on each layer separately, then it is better to support them in config_list. otherwise, it is better to support them as quantizer's initialization argument

Agree, and I think it should be supported in config_list since weight and activation may use different settings and even different layers can use different settings.

dtype and qscheme can be applied on each layer separately. Also we can customize validate_config function to control the rules for each quantizer

linbinskn · 2021-07-17T09:02:06Z

examples/model_compress/quantization/observer_quantizer.py

+def calibration(model, device, test_loader):
+    model.eval()
+    with torch.no_grad():
+        for data, target in test_loader:


Recommend using _ to substitute target if target is not used.

linbinskn · 2021-07-19T00:50:41Z

nni/algorithms/compression/pytorch/quantization/quantizers.py

+        # NOTE: this quantizer is experimental for now. The dtype and qscheme of quantization
+        # is hard-coded.
+        # TODO:
+        # 1. support dtype and qscheme customization through config_list. Current settings:


Agree, and I think it should be supported in config_list since weight and activation may use different settings and even different layers can use different settings.

linbinskn · 2021-07-19T01:36:51Z

Post-training quantization is a critical feature in model compression and I think this pr is a good start for nni to support it. I have some thoughts related to this pr and current NNI quantization design:

Both bit type choice and observers should be supported in config_list in layer level. The reason is:
1. For post-training quantization, in addition to 8bit, the other bit types like 4-bit and 1bit are also supported in hardware such as NVIDIA GPU(>= turing architecture). And more bit type would be supported in future hardwares which enable more quantization data types.
2. For deployment, different backend frameworks may use different observers. We have to offer general support for them by setting config_list. For research, researchers may want to support different observers intro layer(activation/weight) and inter layer to get best performance.
Current design of quantization in NNI is not good enough and it is important for us to define post-training quantization and quantization aware training(not only QAT) clearly in model compression and design in NNI. For instance, I think one potential abstraction is post-training is a one-shot quantization algorithm, and quantization aware training algorithms(not only QAT) are different finetuning methods above it. Should we separate them as different quantizers? Or they are different stages in the whole quantization pipeline(post-training quantization is necessary and quantization aware training is optional) and they can be applied sequentially.

I think we don't need to modify a lot in this pr. We can discuss them and make an appropriate design and customize corresponded API gradually. I am very glad to attend contributions of this part to make it better.

QuanluZhang · 2021-07-19T02:10:38Z

Post-training quantization is a critical feature in model compression and I think this pr is a good start for nni to support it. I have some thoughts related to this pr and current NNI quantization design:

Both bit type choice and observers should be supported in config_list in layer level. The reason is:

For post-training quantization, in addition to 8bit, the other bit types like 4-bit and 1bit are also supported in hardware such as NVIDIA GPU(>= turing architecture). And more bit type would be supported in future hardwares which enable more quantization data types.

For deployment, different backend frameworks may use different observers. We have to offer general support for them by setting config_list. For research, researchers may want to support different observers intro layer(activation/weight) and inter layer to get best performance.

Current design of quantization in NNI is not good enough and it is important for us to define post-training quantization and quantization aware training(not only QAT) clearly in model compression and design in NNI. For instance, I think one potential abstraction is post-training is a one-shot quantization algorithm, and quantization aware training algorithms(not only QAT) are different finetuning methods above it. Should we separate them as different quantizers? Or they are different stages in the whole quantization pipeline(post-training quantization is necessary and quantization aware training is optional) and they can be applied sequentially.

I think we don't need to modify a lot in this pr. We can discuss them and make an appropriate design and customize corresponded API gradually. I am very glad to attend contributions of this part to make it better.

thanks for the discussion. in the current stage, it is not good to put bit types in config_list, because users can easily specify different bit types for different layers. If the supported quantization algorithm does not support different bit types on different layers, it would be error-prone, not user friendly.

about the refactor of quantization in NNI. we can think about how to adapt quantizer into similar modularized framework as NNI pruners. observer in quantization is very similar to "metric calculator" in our refined NNI pruning framework. on the other hand, we should survey all the quantization aware training to figure out whether it can be seen as a type of fine-tuning.

J-shang · 2021-07-19T05:22:10Z

Post-training quantization is a critical feature in model compression and I think this pr is a good start for nni to support it. I have some thoughts related to this pr and current NNI quantization design:

Both bit type choice and observers should be supported in config_list in layer level. The reason is:

For post-training quantization, in addition to 8bit, the other bit types like 4-bit and 1bit are also supported in hardware such as NVIDIA GPU(>= turing architecture). And more bit type would be supported in future hardwares which enable more quantization data types.

For deployment, different backend frameworks may use different observers. We have to offer general support for them by setting config_list. For research, researchers may want to support different observers intro layer(activation/weight) and inter layer to get best performance.

Current design of quantization in NNI is not good enough and it is important for us to define post-training quantization and quantization aware training(not only QAT) clearly in model compression and design in NNI. For instance, I think one potential abstraction is post-training is a one-shot quantization algorithm, and quantization aware training algorithms(not only QAT) are different finetuning methods above it. Should we separate them as different quantizers? Or they are different stages in the whole quantization pipeline(post-training quantization is necessary and quantization aware training is optional) and they can be applied sequentially.

I think we don't need to modify a lot in this pr. We can discuss them and make an appropriate design and customize corresponded API gradually. I am very glad to attend contributions of this part to make it better.

thanks for the discussion. in the current stage, it is not good to put bit types in config_list, because users can easily specify different bit types for different layers. If the supported quantization algorithm does not support different bit types on different layers, it would be error-prone, not user friendly.

about the refactor of quantization in NNI. we can think about how to adapt quantizer into similar modularized framework as NNI pruners. observer in quantization is very similar to "metric calculator" in our refined NNI pruning framework. on the other hand, we should survey all the quantization aware training to figure out whether it can be seen as a type of fine-tuning.

agree, maybe we need a meeting to discuss these, and rethinking what should in compressor and what in config_list. Are all post-training quantization algorithms based on observers?

J-shang · 2021-07-19T05:27:31Z

nni/algorithms/compression/pytorch/quantization/quantizers.py

@@ -120,6 +123,222 @@ def quant_backward(tensor, grad_output, quant_type, scale, zero_point, qmin, qma
        return grad_output


+class ObserverQuantizer(Quantizer):


please add a ut for this quantizer

QuanluZhang · 2021-07-19T06:59:02Z

nni/algorithms/compression/pytorch/quantization/quantizers.py

+                scale, zero_point = self.calculate_qparams(layer.name, 'weight')
+                module.register_buffer('weight_scale', scale.to(self.device))
+                module.register_buffer('weight_zero_point', zero_point.to(self.device))
+                # todo: recover old_weight to weight, because the compressed


could you explain more about the case that should recover old_weight to weight?

Because after being wrapped by QuantizerModuleWrapper, two things may happen to the weight of each layer:

The type may changed from torch.nn.Parameter to torch.Tensor

The weight of BN may have been folded.

Basically, we need to ensure that the structure/parameter types of the model are consistent before and after PTQ. In theory, we can use the original model to perform downstream tasks (like qat or deployment), since the current ptq will not change the weight of the model. But users may also use the model exported by the ptq for the downstream tasks, so I think it is better to recover them.

I have added some code logics in export_model for recovering weight

suppose quantizer's export should export the model after quantization. it is not proper to export original model.

I notice the inconsistency among the supported quantizers. some quantizers deal with fold bn, some others not. some quantizers recover weight, some others not. could you explain the current status a little bit more?

As discussed offline, I replace the origin weight with the pseudo-quantized weight.

chenbohua3 · 2021-07-26T04:31:54Z

Is it necessary to add documentation？I think it’s better to leave this until the dtype/quant scheme customization is ready. Otherwise there will be many NOTEs or TODOs in the doc:)

QuanluZhang · 2021-07-26T11:33:02Z

Is it necessary to add documentation？I think it’s better to leave this until the dtype/quant scheme customization is ready. Otherwise there will be many NOTEs or TODOs in the doc:)

agree

J-shang · 2021-07-26T13:30:18Z

nni/algorithms/compression/pytorch/quantization/quantizers.py

+        else:
+            self.record(wrapper, 'weight', old_weight)
+            new_weight = old_weight
+        return new_weight


seems we never use this new_weight? because there isn't something like module.weight = new_weight

Yes, new_weight will not be used. I just return it like what QAT quantizer does. It also returns unused new_weight. Do I need to delete it?

I think for the quantization simulation after compress(), we need to add module.weight = new_weight.

if self.compressed: new_weight = self._quantize(old_weight, module.weight_scale, module.weight_zero_point, module.weight_qmin, module.weight_qmax) module.weight = new_weight

Will we get the correct inference result in this way?

You are right, have corrected it

J-shang · 2021-07-26T13:32:42Z

nni/algorithms/compression/pytorch/quantization/quantizers.py

+                calibration_config[name]['weight_bit'] = 8
+                val = float(module.weight_scale * module.weight_qmax)
+                calibration_config[name]['tracked_min_weight'] = val
+                calibration_config[name]['tracked_max_weight'] = -val


is calibration_config[name]['tracked_min_weight'] = -val and calibration_config[name]['tracked_max_weight'] = val?

Yes, have corrected it

chenbohua3 · 2021-07-27T08:03:58Z

I have removed the codes for replacing weight with quantized one in compress function. @QuanluZhang @J-shang

QuanluZhang · 2021-07-27T08:57:41Z

test/ut/sdk/test_compressor_torch.py

+        new_parameters = dict(model.named_parameters())
+        self.assertTrue(all(torch.equal(v, new_parameters[k]) for k, v in origin_parameters.items()))
+        self.assertTrue(calibration_config is not None)
+        self.assertTrue(len(calibration_config) == 4)


please update test accordingly

linbinskn · 2021-07-27T10:52:29Z

nni/algorithms/compression/pytorch/quantization/quantizers.py

+        modules_to_compress = self.get_modules_to_compress()
+        all_observers = defaultdict(dict)
+        weight_q_min, weight_q_max = -127, 127
+        activation_q_min, activation_q_max = 0, 127  # reduce_range is set to True


Why we set quantized activation range to (0, 127) instead of (0,255)?

By default, activation observer's reduce_range is set to True. This means that the range of the quantized data type is reduced by 1 bit. This is sometimes required to avoid instruction overflow.
However, there does exist a mismatch between here and that in export_model, I have corrected it.

QuanluZhang requested review from linbinskn, J-shang and zheng-ningxin July 9, 2021 07:23

linbinskn reviewed Jul 12, 2021

View reviewed changes

chenbohua3 force-pushed the observer_quantizer branch from d634b68 to d4fc623 Compare July 13, 2021 03:08

QuanluZhang reviewed Jul 18, 2021

View reviewed changes

linbinskn reviewed Jul 19, 2021

View reviewed changes

J-shang reviewed Jul 19, 2021

View reviewed changes

QuanluZhang reviewed Jul 19, 2021

View reviewed changes

QuanluZhang approved these changes Jul 26, 2021

View reviewed changes

chenbohua3 force-pushed the observer_quantizer branch from d4fc623 to c9e048e Compare July 26, 2021 02:50

QuanluZhang mentioned this pull request Jul 26, 2021

NNI 2021 June~July Iteration Planning #3724

Closed

J-shang approved these changes Jul 26, 2021

View reviewed changes

chenbohua3 added 6 commits July 26, 2021 15:26

Add post training observer_quantizer

c082148

fix linter

11a40c3

refine

b717e53

refine

21bfa52

rebase

d124415

refine

656c63d

chenbohua3 force-pushed the observer_quantizer branch from 5886fc7 to 656c63d Compare July 26, 2021 07:27

linbinskn approved these changes Jul 26, 2021

View reviewed changes

chenbohua3 closed this Jul 26, 2021

chenbohua3 reopened this Jul 26, 2021

refine

8aef2c2

J-shang reviewed Jul 26, 2021

View reviewed changes

chenbohua3 added 4 commits July 27, 2021 09:53

refine

455cc12

simulate quantization in export_model

ab32e8e

assign new_weight to weight in eval mode

28be4fe

quantize weight in compress

455340a

QuanluZhang reviewed Jul 27, 2021

View reviewed changes

update ut

b67bda5

linbinskn reviewed Jul 27, 2021

View reviewed changes

fix wrong range

365d812

QuanluZhang merged commit 370e88d into microsoft:master Jul 28, 2021

		from nni.algorithms.compression.pytorch.quantization import ObserverQuantizer


		class Mnist(torch.nn.Module):

		@@ -120,6 +123,222 @@ def quant_backward(tensor, grad_output, quant_type, scale, zero_point, qmin, qma
		return grad_output


		class ObserverQuantizer(Quantizer):

Add post training observer_quantizer #3915

Add post training observer_quantizer #3915

Conversation

chenbohua3 commented Jul 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linbinskn commented Jul 19, 2021 • edited Loading

QuanluZhang commented Jul 19, 2021

J-shang commented Jul 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenbohua3 commented Jul 26, 2021

QuanluZhang commented Jul 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenbohua3 commented Jul 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenbohua3 commented Jul 8, 2021 •

edited

Loading

linbinskn commented Jul 19, 2021 •

edited

Loading