Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support bigbird ONNX export with attention_type == "block_sparse" #754

Open
2 of 4 tasks
harindercnvrg opened this issue Feb 7, 2023 · 9 comments
Open
2 of 4 tasks
Labels
feature-request New feature or request onnx Related to the ONNX export

Comments

@harindercnvrg
Copy link

harindercnvrg commented Feb 7, 2023

System Info

CPU

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 57 bits virtual
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  2
Core(s) per socket:  32
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               106
Model name:          Intel(R) Xeon(R) Gold 6338N CPU @ 2.20GHz
Stepping:            6
CPU MHz:             2200.000
CPU max MHz:         3500.0000
CPU min MHz:         800.0000
BogoMIPS:            4400.00
Virtualization:      VT-x
L1d cache:           48K
L1i cache:           32K
L2 cache:            1280K
L3 cache:            49152K
NUMA node0 CPU(s):   0-31,64-95
NUMA node1 CPU(s):   32-63,96-127
python == 3.8.10

Installed packages:

absl-py==1.4.0
aiofiles==22.1.0
aiohttp==3.8.3
aiosignal==1.3.1
aiosqlite==0.18.0
anyio==3.6.2
argcomplete==1.10.3
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
asttokens==2.2.1
async-timeout==4.0.2
attrs==22.2.0
autograd==1.5
azure-core==1.10.0
azure-storage-blob==12.6.0
Babel==2.11.0
backcall==0.2.0
backports.zoneinfo==0.2.1
beautifulsoup4==4.8.2
bleach==6.0.0
boto3==1.26.64
botocore==1.29.64
cachetools==5.3.0
certifi==2022.12.7
cffi==1.15.1
chardet==3.0.4
charset-normalizer==2.1.1
click==8.1.3
cma==2.7.0
cnvrg==0.7.54
colorama==0.4.6
coloredlogs==15.0.1
comm==0.1.2
compressed-rtf==1.0.6
contourpy==1.0.7
croniter==1.3.8
cryptography==39.0.0
cycler==0.11.0
datasets==2.9.0
debugpy==1.6.6
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.6
docx2txt==0.8
ebcdic==1.1.1
evaluate==0.4.0
executing==1.2.0
extract-msg==0.28.7
fastjsonschema==2.16.2
filelock==3.9.0
flatbuffers==23.1.21
fonttools==4.38.0
fqdn==1.5.1
frozenlist==1.3.3
fsspec==2023.1.0
future==0.18.3
gitdb==4.0.10
GitPython==3.1.30
google-api-core==2.11.0
google-auth==2.16.0
google-auth-oauthlib==0.4.6
google-cloud-core==2.3.2
google-cloud-storage==2.7.0
google-crc32c==1.5.0
google-resumable-media==2.4.1
googleapis-common-protos==1.58.0
grpcio==1.51.1
huggingface-hub==0.12.0
humanfriendly==10.0
idna==2.10
IMAPClient==2.1.0
importlib-metadata==6.0.0
importlib-resources==5.10.2
ipykernel==6.21.1
ipython==8.9.0
ipython-genutils==0.2.0
isodate==0.6.1
isoduration==20.11.0
jedi==0.18.2
Jinja2==3.1.2
jmespath==1.0.1
joblib==1.2.0
json5==0.9.11
jsonpointer==2.3
jsonschema==4.17.3
jstyleson==0.0.2
jupyter-client==8.0.2
jupyter-core==5.2.0
jupyter-events==0.5.0
jupyter-server==2.2.1
jupyter-server-fileid==0.6.0
jupyter-server-mathjax==0.2.6
jupyter-server-terminals==0.4.4
jupyter-server-ydoc==0.6.1
jupyter-ydoc==0.2.2
jupyterlab==3.6.1
jupyterlab-git==0.41.0
jupyterlab-pygments==0.2.2
jupyterlab-server==2.19.0
kiwisolver==1.4.4
lxml==4.9.2
Markdown==3.4.1
MarkupSafe==2.1.2
matplotlib==3.6.3
matplotlib-inline==0.1.6
mistune==2.0.4
mpmath==1.2.1
msrest==0.6.21
multidict==6.0.4
multiprocess==0.70.14
natsort==8.2.0
nbclassic==0.5.1
nbclient==0.7.2
nbconvert==7.2.9
nbdime==3.1.1
nbformat==5.7.3
nest-asyncio==1.5.6
networkx==2.8.2
ninja==1.10.2.4
nncf==2.4.0
notebook==6.5.2
notebook-shim==0.2.2
numpy==1.23.4
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauthlib==3.2.2
olefile==0.46
onnx==1.12.0
onnxruntime==1.12.1
openvino==2022.3.0
openvino-telemetry==2022.3.0
optimum==1.6.3
optimum-intel==1.6.1
packaging==23.0
pandas==1.5.2
pandocfilters==1.5.0
parso==0.8.3
pdfminer.six==20191110
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.4.0
pkgutil-resolve-name==1.3.10
platformdirs==2.6.2
progress==1.6
prometheus-client==0.16.0
prompt-toolkit==3.0.36
protobuf==3.20.1
psutil==5.9.4
ptyprocess==0.7.0
pure-eval==0.2.2
pyaml==21.10.1
pyarrow==11.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
pycryptodome==3.17
pydot==1.4.2
Pygments==2.14.0
pymoo==0.5.0
pyparsing==2.4.7
pyrsistent==0.19.3
python-dateutil==2.8.2
python-json-logger==2.0.4
python-pptx==0.6.21
pytz==2022.7.1
pytz-deprecation-shim==0.1.0.post0
PyYAML==6.0
pyzmq==25.0.0
regex==2022.10.31
requests==2.28.2
requests-oauthlib==1.3.1
responses==0.18.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rsa==4.9
s3transfer==0.6.0
scikit-learn==1.2.1
scipy==1.10.0
Send2Trash==1.8.0
sentencepiece==0.1.97
six==1.12.0
smmap==5.0.0
sniffio==1.3.0
sortedcontainers==2.4.0
soupsieve==2.3.2.post1
SpeechRecognition==3.8.1
stack-data==0.6.2
sympy==1.11.1
tensorboard==2.11.2
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
terminado==0.17.1
textract==1.6.5
texttable==1.6.7
threadpoolctl==3.1.0
tinycss2==1.2.1
tinynetrc==1.3.1
tokenizers==0.13.2
tomli==2.0.1
torch==1.13.1
torchvision==0.14.1
tornado==6.2
tqdm==4.64.1
traitlets==5.9.0
transformers==4.26.0
typing-extensions==4.4.0
tzdata==2022.7
tzlocal==4.2
uri-template==1.2.0
urllib3==1.25.11
wcwidth==0.2.6
webcolors==1.12
webencodings==0.5.1
websocket-client==1.5.1
Werkzeug==2.2.2
xlrd==1.2.0
XlsxWriter==3.0.8
xxhash==3.2.0
y-py==0.5.5
yarl==1.8.2
ypy-websocket==0.8.2
zipp==3.12.1

Who can help?

@lewtun @michaelbenayoun

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I converted the summarizer model to onnx and then ran it:

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")
billsum = billsum.train_test_split(test_size=0.2)
to_summarize = billsum["train"][0]['text']

model_id = "google/pegasus-pubmed"
model = ORTModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("summarization", model=model, tokenizer=tokenizer)
prediction = pipe(to_summarize)

I also tried the openvino runtime

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import OVModelForSeq2SeqLM
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")
billsum = billsum.train_test_split(test_size=0.2)
to_summarize = billsum["train"][0]['text']

model_id = "google/pegasus-pubmed"
model = OVModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("summarization", model=model, tokenizer=tokenizer)
prediction = pipe(to_summarize)

Expected behavior

This is supposed to provide faster inference than the original pytorch model. Neither the onnx and nor the openvino runtime improve speed, in fact the inference time increases by manifold.

@harindercnvrg harindercnvrg added the bug Something isn't working label Feb 7, 2023
@fxmarty
Copy link
Contributor

fxmarty commented Feb 7, 2023

Thank you, for OpenVINO, could you open an issue in https://github.com/huggingface/optimum-intel ?

For ONNX Runtime, I suspect it is the same issue as #753 .

@harindercnvrg
Copy link
Author

I have raised the issue.
huggingface/optimum-intel#188

@fxmarty
Copy link
Contributor

fxmarty commented Feb 11, 2023

@harindercnvrg Could you provide a reproduction script and the result of lscpu?

I would recommend as well to try on Optimum main as a critical bug due to the transformers 4.26.0 release was fixed recently: #756 . The fix will be included in the next release next week.

@harindercnvrg
Copy link
Author

@fxmarty the system info provided at the top of the issue is from the lscpu command. I have also provided reproduction script at the bottom of the issue raised.

@fxmarty
Copy link
Contributor

fxmarty commented Feb 13, 2023

Thanks @harindercnvrg my bad I missed the lscpu. I meant a reproduction script with time measured, the scripts above only run inference. So that I can try to reproduce the issue on my side.

@harindercnvrg
Copy link
Author

Orginal code:

from transformers import BigBirdPegasusForConditionalGeneration, AutoTokenizer
from breakup import breaker
from datasets import load_dataset
import time

billsum = load_dataset("billsum", split="ca_test")
billsum = billsum.train_test_split(test_size=0.2)
to_summarize = billsum["train"][0]['text']

tokenizer = AutoTokenizer.from_pretrained("google/bigbird-pegasus-large-pubmed")

model = BigBirdPegasusForConditionalGeneration.from_pretrained("google/bigbird-pegasus-large-pubmed")
tic=time.time()
inputs = tokenizer(to_summarize, return_tensors='pt')
prediction = model.generate(**inputs)
prediction = tokenizer.batch_decode(prediction)
toc=time.time()
print(" Time taken: ", toc-tic)

Using ONNX

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")
billsum = billsum.train_test_split(test_size=0.2)
to_summarize = billsum["train"][0]['text']

model_id ="google/bigbird-pegasus-large-pubmed"
model = ORTModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("summarization", model=model, tokenizer=tokenizer)
tic=time.time()
prediction = pipe(to_summarize)
toc=time.time()
print(" Time taken: ", toc-tic)

Using Openvino

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import OVModelForSeq2SeqLM
from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")
billsum = billsum.train_test_split(test_size=0.2)
to_summarize = billsum["train"][0]['text']

model_id = "google/bigbird-pegasus-large-pubmed"
model = OVModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tic=time.time()
pipe = pipeline("summarization", model=model, tokenizer=tokenizer)
prediction = pipe(to_summarize)
toc=time.time()
print(" Time taken: ", toc-tic)

@fxmarty
Copy link
Contributor

fxmarty commented Feb 13, 2023

Thank you! I can reproduce the issue on main. Averaging over 5 runs for each, I get:

Average time (PyTorch eager): 29.747
Average time (ORT): 58.940

Note that when using ORTModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True), the PyTorch model is exported to ONNX on the fly. A log we get during the export is the following:

Attention type 'block_sparse' is not possible if sequence_length: 16 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3. Changing attention type to 'original_full'...

This issue likely comes from there: the example input provided during the ONNX export is too short, hence registering the wrong controlflows that are slow for long sequences (as the one in the benchmark).

Thank you for notifying, will fix!

@fxmarty
Copy link
Contributor

fxmarty commented Feb 14, 2023

Hi @harindercnvrg , I investigated the issue a bit, and there is a critical issue for the ONNX export of BigBird, given that part of BigBird block sparse attention is written in numpy and pure python.

Up to now, BigBird was solely exported using attention_type == "original_full" (as the sequence length was too short) and not attention_type == "block_sparse" (which is arguably the interesting case). That being said, I understand this influences only the encoder, and it'd be worth checking the decoder only.

I worked a bit on rewritting BigBird to be pure PyTorch, which goes fine, but I am now hitting the issue that torch.onnx.export being extremely slow exporting in the block sparse attention case.

For now, I would recommend you to stick with the PyTorch implementation, or maybe Tensorflow XLA one if you find.

@fxmarty
Copy link
Contributor

fxmarty commented Feb 17, 2023

Hi @harindercnvrg , we will remove the support of bigbird and bigbird-pegasus in the ONNX export in #778 due to this issue.

A large chunk of bigbird's implementation in transformers is written in numpy and pure pytorch that makes it unfit for the ONNX export. I tried to rewrite it as pure PyTorch, which succeeded, but then the export becomes prohibitively slow.

If you would like to have a look and manage to solve the issue, you can start from: huggingface/transformers@main...fxmarty:transformers:use-torch-bigbird

@fxmarty fxmarty closed this as completed Feb 20, 2023
@fxmarty fxmarty reopened this Feb 20, 2023
@fxmarty fxmarty changed the title Inference worse on onnx runtime and openvino runtime for converted seq2seq models on CPU Support bigbird ONNX export with attention_type == "block_sparse" Feb 20, 2023
@fxmarty fxmarty added feature-request New feature or request onnx Related to the ONNX export and removed bug Something isn't working labels Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request New feature or request onnx Related to the ONNX export
Projects
None yet
Development

No branches or pull requests

2 participants