`QuantizeLinear` and `DequantizeLinear` nodes for `Conv2D` get folded as constants. #1719

srihari-humbarwadi · 2021-09-14T20:44:08Z

Describe the bug
I have taken these representations from TensorRT QAT Presentation

Figure(1)

As shown in Figure(1) above, I added QuantizeAndDequantizeV2 nodes before conv2d op and for conv2d kernel in my model.
But after converting it with tf2onnx, I don't seem to find the QuantizeLinear and DequantizeLinear nodes for the conv2d kernel, but as shown in Figure(2) below, I was expecting tf2onnx to keep them and not fold them into constants.
Figure(2)

Figure(3)

Tensorboard visualization of my model, showing QuantizeAndDequantizeV2 ops for both input and weights.

Figure(4)

Netron visualization of onnx model, that shows QuantizeLinear and DequantizeLinear nodes only for conv2d input.

It seems logical to fold the QuantizeLinear and DequantizeLinear nodes for weights into a constant as described here, #1394. But looking at Figure(5) below, it looks like TensorRT requires the QuantizeLinear and DequantizeLinear nodes for both conv2d input and weights!

Figure(5)

Urgency
Blocked use-case, TensorFlow2.x QAT -> ONNX -> TensorRT

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
Tensorflow Version: 2.6
Python version: 3.8

To Reproduce
Adding QuantizeAndDequantizeV2 into TF2.x graphs is not trivial and hence it is difficult for me to add all the code here.

The text was updated successfully, but these errors were encountered:

TomWildenhain-Microsoft · 2021-09-16T17:26:31Z

Hi @srihari-humbarwadi

Ah I see, we have two constant fold optimizers (one for pre-conversion TF, other post-conversion ONNX). Our onnx constant folding optimizer exempts Dequantize ops for the reason you describe:

tensorflow-onnx/tf2onnx/optimizer/const_fold_optimizer.py

Line 58 in 496b65d

skip_type = ["Identity", "DequantizeLinear"]

But the TF one does not:

tensorflow-onnx/tf2onnx/tf_utils.py

Line 219 in 496b65d

    
           can_fold = node.type not in ['Enter', 'Placeholder', 'PlaceholderWithDefault', 'Switch', 'Merge',

srihari-humbarwadi · 2021-09-16T18:00:52Z

@TomWildenhain-Microsoft adding QuantizeAndDequantizeV2 to

tensorflow-onnx/tf2onnx/tf_utils.py

Line 219 in 496b65d

    
           can_fold = node.type not in ['Enter', 'Placeholder', 'PlaceholderWithDefault', 'Switch', 'Merge',

doesn't seem to help!

TomWildenhain-Microsoft · 2021-09-16T18:02:18Z

Can you try using the --output_frozen_graph flag to view the model in Netron. Then we can see if the input graph actually contains a dequantize on the kernel.

srihari-humbarwadi · 2021-09-16T18:22:26Z

This is the frozen_graph from tf2onnx, it doesn't seem to have the QDQ ops for the kernel

And here is the frozen_graph, I created using my tensorflow script. This seems to have the QDQ op present for both input and the kernel!

from tensorflow.python.framework.convert_to_constants import \
    convert_variables_to_constants_v2_as_graph


@tf.function(input_signature=[
    tf.TensorSpec(shape=[1, 32, 32, 3], dtype=tf.float32, name='image')
])
def _forward_pass(image):
    return q_model(image, training=False)


concrete_fn = _forward_pass.get_concrete_function()

frozen_serving_fn, frozen_graph_def = convert_variables_to_constants_v2_as_graph(
    concrete_fn, aggressive_inlining=True)
tf.io.write_graph(frozen_graph_def,
                  'qdq_logs',
                  'frozen_graph_def.pb',
                  as_text=False)

TomWildenhain-Microsoft · 2021-09-16T18:25:28Z

We run all graphs (except large models) through the tf optimizer which might be having this effect. As a hack for testing, try adding the --large_model flag. The resulting onnx will be a zip. Also output the frozen graph and see if the qdq is present on the kernel.

srihari-humbarwadi · 2021-09-16T18:53:56Z

When I use the --large_model flag I see this in the logs

2021-09-17 00:17:59,934 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d/quantize_and_dequantize_1
2021-09-17 00:17:59,935 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d_1/quantize_and_dequantize_1
2021-09-17 00:17:59,935 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d_2/quantize_and_dequantize_1
2021-09-17 00:17:59,935 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d_3/quantize_and_dequantize_1
2021-09-17 00:17:59,935 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d_4/quantize_and_dequantize_1
2021-09-17 00:17:59,936 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d_5/quantize_and_dequantize_1
2021-09-17 00:17:59,936 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d_6/quantize_and_dequantize_1
2021-09-17 00:17:59,936 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d_7/quantize_and_dequantize_1
2021-09-17 00:17:59,936 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d_8/quantize_and_dequantize_1

According to my naming scheme, quant_conv2d<_x>/quantize_and_dequantize_1 is the QDQ op for the kernel!

I also dumped the frozen_graph, but weirdly, it's only 460kb compared to 3450 kb (without --large_model flag). But it does contain the QDQ ops for the kernel. Whereas the onnx model inside the zip does not contain it.

TomWildenhain-Microsoft · 2021-09-16T22:21:09Z

The graph is smaller since the zip contains the weights externally. That's what the --large_model flag does. The log messages you are seeing suggest that "QuantizeAndDequantizeV2" isn't in:

tensorflow-onnx/tf2onnx/tf_utils.py

Line 219 in 496b65d

    
           can_fold = node.type not in ['Enter', 'Placeholder', 'PlaceholderWithDefault', 'Switch', 'Merge',

Did you add it for that test?

srihari-humbarwadi · 2021-09-16T22:41:29Z

Thanks, I tried adding QuantizeAndDequantizeV2 to

tensorflow-onnx/tf2onnx/tf_utils.py

Line 219 in 496b65d

    
           can_fold = node.type not in ['Enter', 'Placeholder', 'PlaceholderWithDefault', 'Switch', 'Merge',

and then using --large_model flag. This got it to work. Another way i got it to work was, instead of enabling the --large_model flag, I manually disabled tf_optimize_grappler and it still works

But generated qdq onnx model doesn't seem to work with TensorRT (throws Internal Error (Assertion !p.kernelWeights.empty() failed.)
A possible issue is transpose op placement for the Conv op kernel. It should ideally be placed before the QuantizeLinear op rather than placing it after DequantizeLinear op onnx/onnx-tensorrt#656 (comment). Is there any way to push it back? (I can think of graphsurgeon)

TomWildenhain-Microsoft · 2021-09-16T23:22:38Z

This should be solvable by adding QuantizeLinear to the const fold optimizer. Then the Dequantize will be pushed through the transpose here:

tensorflow-onnx/tf2onnx/optimizer/const_dequantize_optimizer.py

Line 37 in 496b65d

if node.type not in ["Transpose", "Reshape", "Unsqueeze"]:

srihari-humbarwadi · 2021-09-17T07:31:29Z

This should be solvable by adding QuantizeLinear to the const fold optimizer.
I am sorry, I am not clear about this, did you mean we should add QuantizeLinear here?

tensorflow-onnx/tf2onnx/optimizer/const_fold_optimizer.py

Line 58 in 496b65d

skip_type = ["Identity", "DequantizeLinear"]

tensorflow-onnx/tf2onnx/optimizer/const_dequantize_optimizer.py

Line 37 in 496b65d

if node.type not in ["Transpose", "Reshape", "Unsqueeze"]:

Also, IIUC, ConstDequantizeOptimizer will convert (w => Q => DQ => Transpose) to (w => Q => Transpose => DQ)
this pattern is problematic with TensorRT.

I was trying to convert w => Q => DQ => Transpose to w => Transpose => Q => DQ

TomWildenhain-Microsoft · 2021-09-22T20:17:46Z

Also, IIUC, ConstDequantizeOptimizer will convert (w => Q => DQ => Transpose) to (w => Q => Transpose => DQ)
this pattern is problematic with TensorRT.
I was trying to convert w => Q => DQ => Transpose to w => Transpose => Q => DQ

The ConstDequantizeOptimizer only converts DQ ops with constant inputs. So it should leave (w => Q => DQ => Transpose) as is, which won't work for you. I was hoping to convert it to wq^T => DQ where wq^T is a constant equal to the transpose of the quantized weight. The optimization steps would be:

w => Q => DQ => Transpose => Conv ... - initial setup
wq => DQ => Transpose => Conv ... - The quantize is constant folded into a constant quantized weight
wq^T => DQ => Conv ... - The ConstDequantizeOptimizer detects that DQ has a a constant input and folds it into the transpose.

I am sorry, I am not clear about this, did you mean we should add QuantizeLinear here?

tensorflow-onnx/tf2onnx/optimizer/const_fold_optimizer.py

Line 58 in 496b65d

skip_type = ["Identity", "DequantizeLinear"]

To do the above, you will need post-conversion constant folding for the Quantize op. post-conversion constant folding requires a numpy implementation of the op. So yes, it should be added to that file but not that line.

Will the wq^T => DQ pattern be sufficient for your purposes?
Is it clear where the Quantize implementation needs to be added for constant folding?

srihari-humbarwadi · 2021-09-22T20:55:40Z

Thank you, I seem to understand what needs to be done.
Also, as an alternative, IIUC since we always need to transpose TF Conv2d kernel before feeding it into the ONNX conv op, wouldn't it be much easier to add any additional ops (in this case QuantizeLinear and DequantizeLinear) after performing the transpose during conv op conversion?

For now, i just ended up doing w => transpose => QuantizeAndDequantizeV2 => transpose
And as expected ONNX removes the redundant transpose op at the end to give me this.

def qdq(...):
  inputs_t = tf.transpose(inputs, perm=[3, 2, 0, 1])
  
  if training:
      weights['min_value'].assign(tf.reduce_min(inputs_t, axis=[1, 2, 3]))
      weights['max_value'].assign(tf.reduce_max(inputs_t, axis=[1, 2, 3]))
  
  qdq_inputs_t =  self._qdq_op(
      input=inputs_t,
      input_min=weights['min_value'],
      input_max=weights['max_value'],
      num_bits=self.num_bits,
      range_given=not training)
  
  qdq_inputs = tf.transpose(qdq_inputs_t, perm=[2, 3, 1, 0])
  return qdq_inputs

And I confirmed that the resulting network can be successfully parsed by TensorRT. Also, there is a significant improvement in latency and performance when the Q/DQ op on the weight kernel is not folded. This makes sure that the following fusion is triggered

Can we expect support for this to get included in the next release? This would finally unblock TF2.x models with Q/DQ ops to get converted to ONNX and then TensorRT.

TomWildenhain-Microsoft · 2021-09-22T22:11:51Z

Also, there is a significant improvement in latency and performance when the Q/DQ op on the weight kernel is not folded.

To clarify here, the fusion pattern as I understand it should match so long as the DQ op on the weight kernel is not folded. Folding the Q should be fine. Note how in your image W' has a DQ but no Q is needed.

wouldn't it be much easier to add any additional ops (in this case QuantizeLinear and DequantizeLinear) after performing the transpose during conv op conversion?

It would be possible to do this with conv. I don't know if it would be easier than adding the Quantize op to constant folding.

srihari-humbarwadi · 2021-10-17T20:33:07Z

Setting grappler config rewrite_options.experimental_disable_folding_quantization_emulation = True here seems to disable the constant folding of Q/QD ops on Conv2D kernel

tensorflow-onnx/tf2onnx/tf_loader.py

Lines 680 to 682 in 4245d8d

    
           config = config_pb2.ConfigProto() 
        
           rewrite_options = config.graph_options.rewrite_options 
        
           config.graph_options.infer_shapes = True

It would be possible to do this with conv. I don't know if it would be easier than adding the Quantize op to constant folding.

There are few other cases, where the inputs are not constants, so I was hoping to arrive at a more generalized solution for this.
For example in the above graph, the squeeze op should be placed before the QuantizeLinear op.

Is there a way to always add tensor shape manipulation ops before QuantizeLinear op? Or a better question --- Is there a way to make sure QuantizeLinear op is always followed by a DequantizeLinear op?

Since Quantize ops are always immediately followed by DeQuantize ops when building tensorflow graphs with QuantizeAndDequantizeV2 nodes, making sure that this stays the same during ONNX conversion seems pretty important, because this will fail to trigger TensoRT fusion patterns
@TomWildenhain-Microsoft

TomWildenhain-Microsoft · 2021-10-18T19:27:11Z

Setting grappler config rewrite_options.experimental_disable_folding_quantization_emulation = True here seems to disable the constant folding of Q/QD ops on Conv2D kernel

Nice work! We should update tf2onnx to have this flag. Can you submit a PR?

For example in the above graph, the squeeze op should be placed before the QuantizeLinear op.

Are you sure about this? It was my understanding from the image you gave that the fusion doesn't really need the Quantize to be followed by a dequantize, rather it needs the op being fused (Conv, MatMul, etc) to be surrounded by Dequantize/Quantize ops on the inputs/outputs, respectively (outputs are optional in some cases):

I don't see any boxes around Q/DQ pairs. If the squeeze was moved up, the GlobalAveragePool op would not be eligible for fusing.

srihari-humbarwadi · 2021-10-18T19:58:06Z

I was under the same impression, but when I ran the TensoRT engine built from ONNXgraph, I could see some dangling Q ops left in the graph which show up as warnings. I raised this issue on (tensorrt forum comment) and was told to move the squeeze op so that Q is followed by a DQ op. And once I pushed it behind the Q op, I could see some improvement in the inference latency and all the warnings about missing quantization data go away go away.

There is yet another change that I had to make in ONNX optimizers to achieve optimal fusion during TensorRT engine building. TensoRT recommends (tensorrt forum comment) having a Q/DQ op on the identity branch of the resnet block before the Add op. But when I added the Q/DQ on the identity branch, MergeDuplicatedNodesOptimizer merged the redundant Q/DQ op as it can been seen in the graph on right, adding if node.type in ["Identity", "QuantizeLinear", "DequantizeLinear"]: here avoids this merge

tensorflow-onnx/tf2onnx/optimizer/merge_duplicated_nodes_optimizer.py

Lines 112 to 115 in 4245d8d

    
           if node.type in ["Identity"]: 
        
               return True 
        
           if node.is_graph_input(): 
        
               return True

The graph on left is what I get after making the change.

TomWildenhain-Microsoft · 2021-10-19T18:04:12Z

If merging two identical nodes causes an issue, I'd really consider that a problem with TensorRT. But we could still exempt them. It almost seems like the best way to deal with Q/DQ would be in a post-processing phase after all the optimizers run. During that phase we could detect all the Q/DQ pairs, propagate quantization info according to rules (like the Relu/Add case), and then write Q/DQ pairs back into the graph.

fatcat-z · 2022-03-16T11:09:12Z

Resolved this issue with a PR #1856. Close it.

srihari-humbarwadi mentioned this issue Oct 17, 2021

feat: For QDQ models, add disable_constfold support and ensure the conv weights are transposed without using an explicit op. #1730

Closed

fatcat-z closed this as completed Mar 16, 2022

rado82 mentioned this issue Jun 16, 2022

QDQ node for weight tensor of Con2D undergoes Constant folding (enabled for node using tf type=FakeQuantWithMinMaxVarsPerChannel) #1972

Open

srihari-humbarwadi mentioned this issue Oct 2, 2022

Allow narrow_range=1 for FakeQuantWithMinMaxVars #1713

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`QuantizeLinear` and `DequantizeLinear` nodes for `Conv2D` get folded as constants. #1719

`QuantizeLinear` and `DequantizeLinear` nodes for `Conv2D` get folded as constants. #1719

srihari-humbarwadi commented Sep 14, 2021 •

edited

Loading

TomWildenhain-Microsoft commented Sep 16, 2021

srihari-humbarwadi commented Sep 16, 2021

TomWildenhain-Microsoft commented Sep 16, 2021

srihari-humbarwadi commented Sep 16, 2021

TomWildenhain-Microsoft commented Sep 16, 2021

srihari-humbarwadi commented Sep 16, 2021

TomWildenhain-Microsoft commented Sep 16, 2021

srihari-humbarwadi commented Sep 16, 2021 •

edited

Loading

TomWildenhain-Microsoft commented Sep 16, 2021

srihari-humbarwadi commented Sep 17, 2021

TomWildenhain-Microsoft commented Sep 22, 2021

srihari-humbarwadi commented Sep 22, 2021 •

edited

Loading

TomWildenhain-Microsoft commented Sep 22, 2021

srihari-humbarwadi commented Oct 17, 2021 •

edited

Loading

TomWildenhain-Microsoft commented Oct 18, 2021

srihari-humbarwadi commented Oct 18, 2021 •

edited

Loading

TomWildenhain-Microsoft commented Oct 19, 2021

fatcat-z commented Mar 16, 2022

QuantizeLinear and DequantizeLinear nodes for Conv2D get folded as constants. #1719

QuantizeLinear and DequantizeLinear nodes for Conv2D get folded as constants. #1719

Comments

srihari-humbarwadi commented Sep 14, 2021 • edited Loading

TomWildenhain-Microsoft commented Sep 16, 2021

srihari-humbarwadi commented Sep 16, 2021

TomWildenhain-Microsoft commented Sep 16, 2021

srihari-humbarwadi commented Sep 16, 2021

TomWildenhain-Microsoft commented Sep 16, 2021

srihari-humbarwadi commented Sep 16, 2021

TomWildenhain-Microsoft commented Sep 16, 2021

srihari-humbarwadi commented Sep 16, 2021 • edited Loading

TomWildenhain-Microsoft commented Sep 16, 2021

srihari-humbarwadi commented Sep 17, 2021

TomWildenhain-Microsoft commented Sep 22, 2021

srihari-humbarwadi commented Sep 22, 2021 • edited Loading

TomWildenhain-Microsoft commented Sep 22, 2021

srihari-humbarwadi commented Oct 17, 2021 • edited Loading

TomWildenhain-Microsoft commented Oct 18, 2021

srihari-humbarwadi commented Oct 18, 2021 • edited Loading

TomWildenhain-Microsoft commented Oct 19, 2021

fatcat-z commented Mar 16, 2022

`QuantizeLinear` and `DequantizeLinear` nodes for `Conv2D` get folded as constants. #1719

`QuantizeLinear` and `DequantizeLinear` nodes for `Conv2D` get folded as constants. #1719

srihari-humbarwadi commented Sep 14, 2021 •

edited

Loading

srihari-humbarwadi commented Sep 16, 2021 •

edited

Loading

srihari-humbarwadi commented Sep 22, 2021 •

edited

Loading

srihari-humbarwadi commented Oct 17, 2021 •

edited

Loading

srihari-humbarwadi commented Oct 18, 2021 •

edited

Loading