Failure of reproducing sparsification of MobileBERT-oBERT-SQuAD #1534

absol13 · 2023-04-24T11:00:47Z

Describe the bug
I am trying to reproduce sparsification process of MobileBERT-oBERT-SQuAD model. (zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50_quant-none-vnni)
I trained the base model (zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base-none) with the recipe of 32-epochs as from sparseZoo. But the process of converting sparsified model is missing in the model description, so I used sparseml.transformers.export_onnx tool to convert it to onnx format.
However, our new sparsified model is much slower than the released model in onnx format. In our environment, inference speed of our model is about 7 times slower than the original model. What makes me more confusing was the throughput of the model converted from the uploaded checkpoint of MobileBERT-oBERT-SQuAD was also slower than the released model.
Also, I discovered that the analysis result of our new model from deepsparse.analyze tool is different from that of the released model.
The analysis result of the released model describes the detailed analysis on the wall-time of each layer, but that of our model does not show any model structure and following analysis.
I wonder if there are some missing links in converting sparsity-aware trained model to onnx format.

Also, I discovered another issue complaining about similar case as mine: #1364 .

Expected behavior
The 14layer_pruned50_quant-none-vnni model converted by sparseml.transformers.export_onnx from the officially released checkpoint in sparseZoo should show similar throughput to the released model in onnx format.

Environment
Include all relevant environment information:

OS: Ubuntu 20.04
Python version: 3.8
SparseML version or commit hash: 1.4.4 (official latest)
ML framework version(s): torch 1.12.0+cu113
Other Python package versions: onnx 1.12.0, DeepSparse 1.4.2
Other relevant environment information [e.g. hardware, CUDA version]: CUDA 11.3

DeepSparse test environment: Ubuntu 18.04, Intel(R) Xeon(R) Gold 6242 CPU (Other packages have same version as above)

To Reproduce
sparseml.transformers.export_onnx --task question-answering --model_path ./ --sequence_length 384
(You may reproduce using the checkpoint at zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50_quant-none-vnni .)

Errors

Analysis result of the newly converted model

Name                        | OutDims                    | KerDims                    | Strides      | ActSpars | Time(ms) |  Util(%) | TFLOPS   | Canonical Name
Naive Subgraph 0            | [-1, -1, -1, -1, -1]       | []                         | [-1, -1, -1] |   0.0000 | 2782.3520 |   0.0000 |   0.0000 | <none>
Total Time(MS): 2782.352000
Items per second: 5.750530
Batch Size: 16
Number of threads: 1

Analysis result of the officially released model by deepsparse.analyze

...
pyramid_247                 | [16, 384, 1, 1, 1]         | []                         | [1, 1, 1]    |   0.0000 |   0.0400 |   0.0000 |   0.0000 |
  sub_pyramid               | []                         | []                         | []           |   0.0000 |   0.0000 |   0.0000 |   0.0000 |
  shuffle                   | [16, 384, 1, 1, 2]         | []                         | []           |   0.0000 |   0.0270 |      nan |      nan | <none>
  shuffle                   | [16, 384, 1, 1, 1]         | []                         | []           |   0.0000 |   0.0050 |      nan |      nan | <none>
  shuffle                   | [16, 384, 1, 1, 1]         | []                         | []           |   0.0000 |   0.0040 |      nan |      nan | <none>
Naive Subgraph 1            | [-1, -1, -1, -1, -1]       | []                         | [-1, -1, -1] |   0.0000 |   0.1200 |   0.0000 |   0.0000 | <none>
Total Time(MS): 1677.556000
Items per second: 9.537685
Batch Size: 16
Number of threads: 1

== Layer Breakdown ==
Name                           | Summed Time | Percent Taken
Naive Subgraph 0               |    4.384    | 0.26%
shuffle                        |  296.853    | 17.92%
gemm                           |  791.355    | 47.76%
  kernel=[384, 512, 1, 1, 1]   |   24.326    | 1.47%
  kernel=[512, 128, 1, 1, 1]   |  258.028    | 15.57%
  kernel=[128, 512, 1, 1, 1]   |  166.731    | 10.06%
  kernel=[128, 128, 1, 1, 1]   |   70.079    | 4.23%
  kernel=[512, 2, 1, 1, 1]     |    1.248    | 0.08%
elementwise                    |   66.248    | 4.00%
ks_gemm                        |  375.736    | 22.68%
  kernel=[128, 512, 1, 1, 1]   |  355.582    | 21.46%
  kernel=[128, 128, 1, 1, 1]   |    5.504    | 0.33%
  kernel=[512, 128, 1, 1, 1]   |   14.650    | 0.88%
softmax                        |  122.250    | 7.38%
Naive Subgraph 1               |    0.120    | 0.01%
== Summed Total Time: 1656.9460 ms
== Items per second: 9.6563

Additional context
Add any other context about the problem here. Also include any relevant files.

The text was updated successfully, but these errors were encountered:

mgoin · 2023-04-24T16:57:32Z

Hi @absol13 is there a warning about dynamic shape? It looks like no operations are being run in the engine. Please run the deepsparse benchmarking with a static shape for this testing, such as deepsparse.benchmark model.onnx --input_shapes "[1,384]"

absol13 · 2023-04-25T02:29:02Z

Hello. Thanks for your fast response, but it does not seem to be a problem of dynamic shape input. I add another weird benchmark result of the model converted by sparseml.transformers.export_onnx:

2023-04-25 02:22:35 deepsparse.benchmark.benchmark_model INFO     deepsparse.engine.Engine:
        onnx_file_path: model.onnx
        batch_size: 16
        num_cores: 1
        num_streams: 1
        scheduler: Scheduler.default
        fraction_of_supported_ops: 0.0
        cpu_avx_type: avx2
        cpu_vnni: False

As displayed above, fraction_of_supported_ops has value of 0.0 which is contrast with its value close to 1 in the official released model. It seems that deepsparse engine cannot comprehend models generated by sparseml.transformers.export_onnx.

mgoin · 2023-04-25T17:40:23Z

Thanks for the additional detail @absol13 ! I was able to replicate and isolate the issue in the ONNX post-processing of the export ONNX process. We are working on a fix.

In this image, the ONNX from the SparseZoo is on the left and the disfunctional exported model is on the right. There is a single MatMul that wasn't folded properly in the ONNX export.

absol13 · 2023-04-27T01:57:31Z

Thanks for your fast support.

absol13 added the bug Something isn't working label Apr 24, 2023

mgoin assigned anmarques Apr 26, 2023

mgoin linked a pull request Apr 26, 2023 that will close this issue

Fix ONNX export for MobileBERT #1539

Merged

anmarques closed this as completed in #1539 Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure of reproducing sparsification of MobileBERT-oBERT-SQuAD #1534

Failure of reproducing sparsification of MobileBERT-oBERT-SQuAD #1534

absol13 commented Apr 24, 2023 •

edited

Loading

mgoin commented Apr 24, 2023

absol13 commented Apr 25, 2023 •

edited

Loading

mgoin commented Apr 25, 2023

absol13 commented Apr 27, 2023

Failure of reproducing sparsification of MobileBERT-oBERT-SQuAD #1534

Failure of reproducing sparsification of MobileBERT-oBERT-SQuAD #1534

Comments

absol13 commented Apr 24, 2023 • edited Loading

mgoin commented Apr 24, 2023

absol13 commented Apr 25, 2023 • edited Loading

mgoin commented Apr 25, 2023

absol13 commented Apr 27, 2023

absol13 commented Apr 24, 2023 •

edited

Loading

absol13 commented Apr 25, 2023 •

edited

Loading