Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QuantizeLinear and DequantizeLinear nodes for Conv2D get folded as constants. #1719

Closed
srihari-humbarwadi opened this issue Sep 14, 2021 · 18 comments

Comments

@srihari-humbarwadi
Copy link

srihari-humbarwadi commented Sep 14, 2021

Describe the bug
I have taken these representations from TensorRT QAT Presentation

Figure(1)
image
As shown in Figure(1) above, I added QuantizeAndDequantizeV2 nodes before conv2d op and for conv2d kernel in my model.
But after converting it with tf2onnx, I don't seem to find the QuantizeLinear and DequantizeLinear nodes for the conv2d kernel, but as shown in Figure(2) below, I was expecting tf2onnx to keep them and not fold them into constants.
Figure(2)
image

Figure(3)
image
Tensorboard visualization of my model, showing QuantizeAndDequantizeV2 ops for both input and weights.

Figure(4)
image
Netron visualization of onnx model, that shows QuantizeLinear and DequantizeLinear nodes only for conv2d input.

It seems logical to fold the QuantizeLinear and DequantizeLinear nodes for weights into a constant as described here, #1394. But looking at Figure(5) below, it looks like TensorRT requires the QuantizeLinear and DequantizeLinear nodes for both conv2d input and weights!

Figure(5)
image

Urgency
Blocked use-case, TensorFlow2.x QAT -> ONNX -> TensorRT

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • Tensorflow Version: 2.6
  • Python version: 3.8

To Reproduce
Adding QuantizeAndDequantizeV2 into TF2.x graphs is not trivial and hence it is difficult for me to add all the code here.

@TomWildenhain-Microsoft
Copy link
Contributor

Hi @srihari-humbarwadi

Ah I see, we have two constant fold optimizers (one for pre-conversion TF, other post-conversion ONNX). Our onnx constant folding optimizer exempts Dequantize ops for the reason you describe:

skip_type = ["Identity", "DequantizeLinear"]

But the TF one does not:

can_fold = node.type not in ['Enter', 'Placeholder', 'PlaceholderWithDefault', 'Switch', 'Merge',

@srihari-humbarwadi
Copy link
Author

@TomWildenhain-Microsoft adding QuantizeAndDequantizeV2 to

can_fold = node.type not in ['Enter', 'Placeholder', 'PlaceholderWithDefault', 'Switch', 'Merge',

doesn't seem to help!

@TomWildenhain-Microsoft
Copy link
Contributor

Can you try using the --output_frozen_graph flag to view the model in Netron. Then we can see if the input graph actually contains a dequantize on the kernel.

@srihari-humbarwadi
Copy link
Author

This is the frozen_graph from tf2onnx, it doesn't seem to have the QDQ ops for the kernel
image


And here is the frozen_graph, I created using my tensorflow script. This seems to have the QDQ op present for both input and the kernel!

image

from tensorflow.python.framework.convert_to_constants import \
    convert_variables_to_constants_v2_as_graph


@tf.function(input_signature=[
    tf.TensorSpec(shape=[1, 32, 32, 3], dtype=tf.float32, name='image')
])
def _forward_pass(image):
    return q_model(image, training=False)


concrete_fn = _forward_pass.get_concrete_function()

frozen_serving_fn, frozen_graph_def = convert_variables_to_constants_v2_as_graph(
    concrete_fn, aggressive_inlining=True)
tf.io.write_graph(frozen_graph_def,
                  'qdq_logs',
                  'frozen_graph_def.pb',
                  as_text=False)

@TomWildenhain-Microsoft
Copy link
Contributor

We run all graphs (except large models) through the tf optimizer which might be having this effect. As a hack for testing, try adding the --large_model flag. The resulting onnx will be a zip. Also output the frozen graph and see if the qdq is present on the kernel.

@srihari-humbarwadi
Copy link
Author

When I use the --large_model flag I see this in the logs

2021-09-17 00:17:59,934 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d/quantize_and_dequantize_1
2021-09-17 00:17:59,935 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d_1/quantize_and_dequantize_1
2021-09-17 00:17:59,935 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d_2/quantize_and_dequantize_1
2021-09-17 00:17:59,935 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d_3/quantize_and_dequantize_1
2021-09-17 00:17:59,935 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d_4/quantize_and_dequantize_1
2021-09-17 00:17:59,936 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d_5/quantize_and_dequantize_1
2021-09-17 00:17:59,936 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d_6/quantize_and_dequantize_1
2021-09-17 00:17:59,936 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d_7/quantize_and_dequantize_1
2021-09-17 00:17:59,936 - INFO - tf2onnx.tfonnx: folding node using tf type=QuantizeAndDequantizeV2, name=StatefulPartitionedCall/model/quant_conv2d_8/quantize_and_dequantize_1

According to my naming scheme, quant_conv2d<_x>/quantize_and_dequantize_1 is the QDQ op for the kernel!

I also dumped the frozen_graph, but weirdly, it's only 460kb compared to 3450 kb (without --large_model flag). But it does contain the QDQ ops for the kernel. Whereas the onnx model inside the zip does not contain it.
image

@TomWildenhain-Microsoft
Copy link
Contributor

The graph is smaller since the zip contains the weights externally. That's what the --large_model flag does. The log messages you are seeing suggest that "QuantizeAndDequantizeV2" isn't in:

can_fold = node.type not in ['Enter', 'Placeholder', 'PlaceholderWithDefault', 'Switch', 'Merge',

Did you add it for that test?

@srihari-humbarwadi
Copy link
Author

srihari-humbarwadi commented Sep 16, 2021

Thanks, I tried adding QuantizeAndDequantizeV2 to

can_fold = node.type not in ['Enter', 'Placeholder', 'PlaceholderWithDefault', 'Switch', 'Merge',
and then using --large_model flag. This got it to work. Another way i got it to work was, instead of enabling the --large_model flag, I manually disabled tf_optimize_grappler and it still works
image

But generated qdq onnx model doesn't seem to work with TensorRT (throws Internal Error (Assertion !p.kernelWeights.empty() failed.)
A possible issue is transpose op placement for the Conv op kernel. It should ideally be placed before the QuantizeLinear op rather than placing it after DequantizeLinear op onnx/onnx-tensorrt#656 (comment). Is there any way to push it back? (I can think of graphsurgeon)

@TomWildenhain-Microsoft
Copy link
Contributor

This should be solvable by adding QuantizeLinear to the const fold optimizer. Then the Dequantize will be pushed through the transpose here:

if node.type not in ["Transpose", "Reshape", "Unsqueeze"]:

@srihari-humbarwadi
Copy link
Author

This should be solvable by adding QuantizeLinear to the const fold optimizer.
I am sorry, I am not clear about this, did you mean we should add QuantizeLinear here?

skip_type = ["Identity", "DequantizeLinear"]


if node.type not in ["Transpose", "Reshape", "Unsqueeze"]:

Also, IIUC, ConstDequantizeOptimizer will convert (w => Q => DQ => Transpose) to (w => Q => Transpose => DQ)
this pattern is problematic with TensorRT.

I was trying to convert w => Q => DQ => Transpose to w => Transpose => Q => DQ

@TomWildenhain-Microsoft
Copy link
Contributor

Also, IIUC, ConstDequantizeOptimizer will convert (w => Q => DQ => Transpose) to (w => Q => Transpose => DQ)
this pattern is problematic with TensorRT.
I was trying to convert w => Q => DQ => Transpose to w => Transpose => Q => DQ

The ConstDequantizeOptimizer only converts DQ ops with constant inputs. So it should leave (w => Q => DQ => Transpose) as is, which won't work for you. I was hoping to convert it to wq^T => DQ where wq^T is a constant equal to the transpose of the quantized weight. The optimization steps would be:

w => Q => DQ => Transpose => Conv ... - initial setup
wq => DQ => Transpose => Conv ... - The quantize is constant folded into a constant quantized weight
wq^T => DQ => Conv ... - The ConstDequantizeOptimizer detects that DQ has a a constant input and folds it into the transpose.

I am sorry, I am not clear about this, did you mean we should add QuantizeLinear here?

skip_type = ["Identity", "DequantizeLinear"]

To do the above, you will need post-conversion constant folding for the Quantize op. post-conversion constant folding requires a numpy implementation of the op. So yes, it should be added to that file but not that line.

Will the wq^T => DQ pattern be sufficient for your purposes?
Is it clear where the Quantize implementation needs to be added for constant folding?

@srihari-humbarwadi
Copy link
Author

srihari-humbarwadi commented Sep 22, 2021

Thank you, I seem to understand what needs to be done.
Also, as an alternative, IIUC since we always need to transpose TF Conv2d kernel before feeding it into the ONNX conv op, wouldn't it be much easier to add any additional ops (in this case QuantizeLinear and DequantizeLinear) after performing the transpose during conv op conversion?


For now, i just ended up doing w => transpose => QuantizeAndDequantizeV2 => transpose
And as expected ONNX removes the redundant transpose op at the end to give me this.

image

def qdq(...):
  inputs_t = tf.transpose(inputs, perm=[3, 2, 0, 1])
  
  if training:
      weights['min_value'].assign(tf.reduce_min(inputs_t, axis=[1, 2, 3]))
      weights['max_value'].assign(tf.reduce_max(inputs_t, axis=[1, 2, 3]))
  
  qdq_inputs_t =  self._qdq_op(
      input=inputs_t,
      input_min=weights['min_value'],
      input_max=weights['max_value'],
      num_bits=self.num_bits,
      range_given=not training)
  
  qdq_inputs = tf.transpose(qdq_inputs_t, perm=[2, 3, 1, 0])
  return qdq_inputs

And I confirmed that the resulting network can be successfully parsed by TensorRT. Also, there is a significant improvement in latency and performance when the Q/DQ op on the weight kernel is not folded. This makes sure that the following fusion is triggered
image

Can we expect support for this to get included in the next release? This would finally unblock TF2.x models with Q/DQ ops to get converted to ONNX and then TensorRT.

@TomWildenhain-Microsoft
Copy link
Contributor

Also, there is a significant improvement in latency and performance when the Q/DQ op on the weight kernel is not folded.

To clarify here, the fusion pattern as I understand it should match so long as the DQ op on the weight kernel is not folded. Folding the Q should be fine. Note how in your image W' has a DQ but no Q is needed.

wouldn't it be much easier to add any additional ops (in this case QuantizeLinear and DequantizeLinear) after performing the transpose during conv op conversion?

It would be possible to do this with conv. I don't know if it would be easier than adding the Quantize op to constant folding.

@srihari-humbarwadi
Copy link
Author

srihari-humbarwadi commented Oct 17, 2021

Setting grappler config rewrite_options.experimental_disable_folding_quantization_emulation = True here seems to disable the constant folding of Q/QD ops on Conv2D kernel

config = config_pb2.ConfigProto()
rewrite_options = config.graph_options.rewrite_options
config.graph_options.infer_shapes = True


image

It would be possible to do this with conv. I don't know if it would be easier than adding the Quantize op to constant folding.

There are few other cases, where the inputs are not constants, so I was hoping to arrive at a more generalized solution for this.
For example in the above graph, the squeeze op should be placed before the QuantizeLinear op.

Is there a way to always add tensor shape manipulation ops before QuantizeLinear op? Or a better question --- Is there a way to make sure QuantizeLinear op is always followed by a DequantizeLinear op?

Since Quantize ops are always immediately followed by DeQuantize ops when building tensorflow graphs with QuantizeAndDequantizeV2 nodes, making sure that this stays the same during ONNX conversion seems pretty important, because this will fail to trigger TensoRT fusion patterns
@TomWildenhain-Microsoft

@TomWildenhain-Microsoft
Copy link
Contributor

Setting grappler config rewrite_options.experimental_disable_folding_quantization_emulation = True here seems to disable the constant folding of Q/QD ops on Conv2D kernel

Nice work! We should update tf2onnx to have this flag. Can you submit a PR?

For example in the above graph, the squeeze op should be placed before the QuantizeLinear op.

Are you sure about this? It was my understanding from the image you gave that the fusion doesn't really need the Quantize to be followed by a dequantize, rather it needs the op being fused (Conv, MatMul, etc) to be surrounded by Dequantize/Quantize ops on the inputs/outputs, respectively (outputs are optional in some cases):

image

I don't see any boxes around Q/DQ pairs. If the squeeze was moved up, the GlobalAveragePool op would not be eligible for fusing.

@srihari-humbarwadi
Copy link
Author

srihari-humbarwadi commented Oct 18, 2021

I was under the same impression, but when I ran the TensoRT engine built from ONNXgraph, I could see some dangling Q ops left in the graph which show up as warnings. I raised this issue on (tensorrt forum comment) and was told to move the squeeze op so that Q is followed by a DQ op. And once I pushed it behind the Q op, I could see some improvement in the inference latency and all the warnings about missing quantization data go away go away.

There is yet another change that I had to make in ONNX optimizers to achieve optimal fusion during TensorRT engine building. TensoRT recommends (tensorrt forum comment) having a Q/DQ op on the identity branch of the resnet block before the Add op. But when I added the Q/DQ on the identity branch, MergeDuplicatedNodesOptimizer merged the redundant Q/DQ op as it can been seen in the graph on right, adding if node.type in ["Identity", "QuantizeLinear", "DequantizeLinear"]: here avoids this merge

if node.type in ["Identity"]:
return True
if node.is_graph_input():
return True

The graph on left is what I get after making the change.
image

@TomWildenhain-Microsoft
Copy link
Contributor

If merging two identical nodes causes an issue, I'd really consider that a problem with TensorRT. But we could still exempt them. It almost seems like the best way to deal with Q/DQ would be in a post-processing phase after all the optimizers run. During that phase we could detect all the Q/DQ pairs, propagate quantization info according to rules (like the Relu/Add case), and then write Q/DQ pairs back into the graph.

@fatcat-z
Copy link
Collaborator

Resolved this issue with a PR #1856. Close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants