28 Apr 16:44

echarlaix

a001ded

v1.8.3: Patch release

Fix Stable Diffusion model ONNX export by @echarlaix in #1020
Add optimum-neuron extra by @michaelbenayoun in #1021

Full Changelog: v1.8.2...v1.8.3

Contributors

michaelbenayoun and echarlaix

Assets 2

17 Apr 13:30

fxmarty

v1.8.2

6b8f1fd

v1.8: extended BetterTransformer support, ONNX merged seq2seq models

Extended BetterTransformer support

Various improvements in the PyTorch BetterTransformer integration.

[BT] add BetterTransformer support for ProphetNet by @hirotasoshu in #923
Improve bettertransformer benchmark script by @fxmarty in #939
Fix sdpa with batch size = 1, better benchmark by @fxmarty in #915
Fix slow tests & sdpa dropout by @fxmarty in #974
Remove getattr overhead in spda by @fxmarty in #934
[BT] Improve docs by @younesbelkada in #944

ONNX merged seq2seq models

Instead of using two separate decoder_model.onnx and decoder_with_past_model.onnx models, a single decoder can be used for encoder-decoder models: decoder_model_merged.onnx. This allows to avoid duplicated weights in the two without/with past ONNX models.

By default, if available, the decoder_model_merged.onnx will be used in the ORTModel integration. This can be disabled with the option --no-post-process in the ONNX export CLI, and with use_merged=False in the ORTModel.from_pretrained method.

Example:

optimum-cli export onnx --model t5-small t5_onnx

will give:

└── t5_onnx
    ├── config.json
    ├── decoder_model_merged.onnx
    ├── decoder_model.onnx
    ├── decoder_with_past_model.onnx
    ├── encoder_model.onnx
    ├── generation_config.json
    ├── special_tokens_map.json
    ├── spiece.model
    ├── tokenizer_config.json
    └── tokenizer.json

And decoder_model_merged.onnx is enough to be used for inference. We strongly recommend to inspect the subgraphs with netron to understand what are the inputs/outputs, in case the exported model is to be used with an other engine than ONNX Runtime in the Optimum integration.

Fix encoder-decoder ONNX merge by @fxmarty in #924
Support the merge of decoder without/with past for encoder-decoder models in the ONNX export by @fxmarty in #926
Support merged seq2seq models in ORTModel by @fxmarty in #930

New models in the ONNX export

Add llama onnx export & onnxruntime support by @nenkoru in #975

Major bugfix

Remove constant output in encoder-decoder ONNX models decoder with past by @fxmarty in #920
Hash tensor data during deduplication by @VikParuchuri in #932

Potentially breaking changes

The TasksManager replaces legacy tasks names by the canonical ones used on the Hub and in transformers metadata:

sequence-classification becomes text-classification,
causal-lm becomes text-generation,
seq2seq-lm becomes text2text-generation,
speech2seq-lm and audio-ctc becomes automatic-speech-recognition,
default becomes feature-extraction,
masked-lm becomes fill-mask,
vision2seq-lm becomes image-to-text

This should not break anything except if you rely on private methods and attributes from TasksManager.

Allow to use a custom class in TasksManager & use canonical tasks names by @fxmarty in #967

What's Changed

Update ort trainer to transformers 4.27.2 by @JingyaHuang in #917
Compute Loss inside the training step. by @AdamLouly in #686
Fix ORTModel MRO for whisper by @fxmarty in #919
add ORTStableDiffusionPipeline reference in documentation by @echarlaix in #890
Fix decoder ONNX model loading from the Hub by @fxmarty in #929
optimun-cli onnxruntime quantize / optimize output argument is now required by @michaelbenayoun in #927
Register mechanism for the Optimum CLI by @michaelbenayoun in #928
Ensure backward compatibility of ORTModel by @fxmarty in #933
Update the README by @michaelbenayoun in #925
Update README by @echarlaix in #941
Update readme by @echarlaix in #942
Remove GC from README by @michaelbenayoun in #943
Add user and token for CI by @michaelbenayoun in #945
Update README by @echarlaix in #946
optimum-cli print the help of subcommands by @michaelbenayoun in #940
Remove from_transformers references from the documentation by @fxmarty in #935
Turn command import into optional by @JingyaHuang in #936
Auto-set use_merged to False if use_cache is passed as False by @fxmarty in #954
Raise error with use_cache=False, use_io_binding=True by @fxmarty in #955
Add an ORT training notebook by @JingyaHuang in #959
Fix issue with doc build sometimes failing silently in GH workflows by @regisss in #960
Fix typos by @regisss in #963
Disable tests upon transformers 4.28 release by @fxmarty in #976

New Contributors

@hirotasoshu made their first contribution in #923
@VikParuchuri made their first contribution in #932

Full Changelog: v1.7.3...v1.8.2

Contributors

VikParuchuri, fxmarty, and 8 other contributors

Assets 2

23 Mar 16:37

fxmarty

v1.7.3

3685483

v1.7.3: Patch release for PyTorch 2.0 and transformers 4.27.0

This patch releases fixes a few bugs with PyTorch 2.0 release, and include a few new features as well.

Breaking change: constant outputs removed from ONNX encoder-decoder models

We removed some constant past key values outputs from encoder-decoder models in the ONNX export. Beware that this could potentially break your existing code, but we recommend to use the new exported models as this removes unnecessary Identity nodes in the models.

Remove constant outputs from decoder with past ONNX model for encoder-decoder architectures by @fxmarty in #872

`torch.nn.functional.scaled_dot_product_attention` support for decoders in BetterTransformer

Pytorch 2.0 introduces in beta torch.nn.functional.scaled_dot_product_attention, a fastpath for attention extending their accelerated transformer features. This is included in optimum.bettertransformer to be used with the following architectures: Bart, Blenderbot, GPT2, GTP-J, M2M100, Marian, Mbart, OPT, Pegasus, T5.

Beware that this is still experimental and speedups have yet to be validated on all architectures.

PyTorch's scaled_dot_product_attention allows to use flash attention and memory efficient attention natively in PyTorch.

Usage is as follow:

from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

model = BetterTransformer.transform(model)  # modify transformers modeling to use native scaled_dot_product_attention

# do you inference or training here

model = BetterTransformer.reverse(model)  # go back to using canonical transformers modeling
model.save_pretrained("gpt2_model")

Inference benchmark (on fp16):

Model	batch size	Input sequence length	Generated tokens	Latency eager (s)	Latency BT (s)	Speedup	Peak memory eager (MB)	Peak memory BT (MB)	Memory savings
gpt2	1	64	256	1.800	1.607	12.0%	569.90	569.89	0%
gpt2	64	64	256	2.159	1.617	33.5%	2067.45	2093.80	0%
opt-1.3b	1	64	256	3.010	2.667	12.9%	5408.238	5408.238	0%
gpt-neox-20b	1	64	256	10.869	9.937	9.4%	83670.67	83673.53	0%

Training benchmark (on fp16):

Model	batch size	Sequence length	time/epoch (eager, s)	time/epoch (BT, s)	Speedup	Peak memory eager (MB)	Peak memory BT (MB)	Memory savings
gpt2	8	1024	17.732	14.037	26.3%	13291.16	10191.52	30.4%
gpt2	32	1024	17.336	13.309	30.3%	52834.83	38858.56	36.0%
gpt2	64	1024	OOM	14.067	/	OOM	75600.08	/

Benchmarks can be reproduced using the inference script and training script:

python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256
python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256 --seqlen-stdev 0

Add scaled_dot_product_attention support for decoder models by @fxmarty in #853
Support scaled_dot_product_attention for t5 by @fxmarty in #856
[BT] add decoder benchmark script by @younesbelkada in #857
[BT] Fix bt benchmark by @younesbelkada in #858
Fix pytorch version check in bettertransformer by @fxmarty in #862
[BT] Add fp16 support by @younesbelkada in #859
[BT] Add decoder training support by @younesbelkada in #860
Bart support scaled_dot_product_attention by @fxmarty in #863
[BT] add accelerate_test markers by @younesbelkada in #864
Mbart, pegasus, blenderbot, marian, m2m_100 support scaled_dot_product_attention by @fxmarty in #865
Add bettertransformer reverse transform by @fxmarty in #868
Add bettertransformer training benchmark script by @fxmarty in #873

New architectures in the ONNX export

Three additional architectures are supported in the ONNX export: ImageGPT, RegNet, OPT.

Adding ONNX support for ImageGPT by @adit299 in #819
Add ONNX support for RegNet by @asrimanth in #833
Adding support for Facebook's OPT models by @hivaze in #852

(WIP) TFLite export with quantization support

Continued progress in the TFLite export with quantization support. This is work in progress and not documented yet.

Quantization with TFLite by @michaelbenayoun in #854

Bugfixes and improvements

Update documentation by @echarlaix in #843
Fix typo in documentation by @regisss in #848
Remove redundant code by @mht-sharma in #841
Update README by @echarlaix in #850
Update documentation by @echarlaix in #855
Remove iobinding ORTModelForCTC by @mht-sharma in #840
Fix typo in documentation by @echarlaix in #861
Fix causal-lm ONNX axis names by @fxmarty in #871
add NNCF openvino notebook by @echarlaix in #875
Remove positional-only parameters not support by python < v3.8 by @echarlaix in #881
lazy import for task manager by @JingyaHuang in #844
Remove onnx and ort dependencies on the TasksManager by @michaelbenayoun in #846
Reactivate export & optimization tests for causal-lm models by @fxmarty in #885
Fix ONNX export on transformers 4.27 release by @fxmarty in #884
Do not use scaled_dot_product_attention for stable diffusion onnx export by @fxmarty in #888
Fix loading of an ONNX stable diffusion model when config doesn't match by @echarlaix in #887
Automatic framework detection in TasksManager for large models by @fxmarty in #883
Fix WavLM onnx export upon torch 2.0 release by @fxmarty in #889
Fix PushToHubMixin._create_repo according to transformers 4.27 release by @fxmarty in #892
Fix stable diffusion framework detection by @fxmarty in #893
Add donut CPU inference ORT by @mht-sharma in #761
Fix check_model for large merged ONNX models by @fxmarty in #896
Drop python 3.7 support by @fxmarty in #891
Fix dummy label generator for vision tasks by @JingyaHuang in #900
Add stable diffusion dummy object by @echarlaix in #899
Automatic support for large ONNX models in ORTOptimizer by @fxmarty in #886
Remove subprocess calls in ONNX export by @fxmarty in #897
Registering mechanism for the TasksManager by @michaelbenayoun in https://github.com/huggingface/optimum/pull...

Contributors

fxmarty, regisss, and 10 other contributors

Assets 2

03 Mar 13:41

fxmarty

v1.7.1

8252f4b

v1.7.1: Patch release

Temporarily fix a critical bug in BetterTransformer #849

Full Changelog: v1.7.0...v1.7.1

Assets 2

02 Mar 12:32

fxmarty

v1.7.0

987b02e

v1.7.0: ONNX export extension, TFLite export, single-ONNX decoding, ONNX Runtime extension for audio, vision tasks, stable diffusion

New models supported in the ONNX export

Additional architectures are supported in the ONNX export: PoolFormer, Pegasus, Audio Spectrogram Transformer, Hubert, SEW, Speech2Text, UniSpeech, UniSpeech-SAT, Wav2Vec2, Wav2Vec2-Conformer, WavLM, Data2Vec Audio, MPNet, stable diffusion VAE encoder, vision encoder decoder, Nystromformer, Splinter, GPT NeoX.

Add PoolFormer support in exporters.onnx by @BakingBrains in #646
Support pegasus exporters by @mht-sharma in #620
Audio models support with optimum.exporters.onnx by @michaelbenayoun in #622
Add MPNet ONNX export by @jplu in #691
Add stable diffusion VAE encoder export by @echarlaix in #705
Add vision encoder decoder model in exporters by @mht-sharma in #588
Nystromformer ONNX export by @whr778 in #728
Support Splinter exporters (#555) by @Allanbeddouk in #736
Add gpt-neo-x support by @sidthekidder in #745

New models supported in BetterTransformer

A few additional architectures are supported in BetterTransformer: RoCBERT, RoFormer, Marian

Add RoCBert support for Bettertransformer by @shogohida in #542
Add better transformer support for RoFormer by @manish-p-gupta in #680
added BetterTransformer support for Marian by @IlyasMoutawwakil in #808

Additional tasks supported in the ONNX Runtime integration

With ORTModelForMaskedLM, ORTModelForVision2Seq, ORTModelForAudioClassification, ORTModelForCTC, ORTModelForAudioXVector, ORTModelForAudioFrameClassification, ORTStableDiffusionPipeline.

Reference: https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort and https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/models#export-and-inference-of-stable-diffusion-models

Add ORTModelForMaskedLM class by @JingyaHuang in #729
Add ORTModelForVision2Seq for VisionEncoderDecoder models inference by @mht-sharma in #742
Add ORTModelXXX for audio by @mht-sharma in #774
Add stable diffusion onnx runtime pipeline by @echarlaix in #786

Support of the ONNX export from PyTorch on float16

In the ONNX export, it is possible to pass the options --fp16 --device cuda to export using float16 when a GPU is available, directly with the native torch.onnx.export.

Example: optimum-cli export onnx --model gpt2 --fp16 --device cuda gpt2_onnx/

Support ONNX export on torch.float16 type by @fxmarty in #749

TFLite export

TFLite export is now supported, with static shapes:

optimum-cli export tflite --help
optimum-cli export tflite --model bert-base-uncased --sequence_length 128 bert_tflite/

exporters.tflite initial support by @michaelbenayoun in #716
TFLite auto-encoder models by @michaelbenayoun in #757
[TFLite Export] Adds support for ResNet by @sayakpaul in #813

ONNX Runtime optimization and quantization directly in the CLI

Add optimize and quantize command CLI by @jplu in #700
Support ONNX Runtime optimizations in exporters.onnx by @fxmarty in #807

The ONNX export optionally supports the ONNX Runtime optimizations directly in the export, passing the --optimize O1, up to --optimize O4 option:

optimum-cli export onnx --help
optimum-cli export onnx --model t5-small --optimize O3 t5small_onnx/

ONNX Runtime quantization is supported directly in command line, using optimum-cli onnxruntime quantize:

optimum-cli onnxruntime quantize --help
optimum-cli onnxruntime quantize --onnx_model distilbert_onnx --avx512

ONNX Runtime optimization is supported directly in command line, using optimum-cli onnxruntime optimize:

optimum-cli onnxruntime optimize --help
optimum-cli onnxruntime optimize --onnx_model distilbert_onnx -O3

ORTModelForCausalLM supports decoding with a single ONNX

Up no now, for decoders, two ONNX were used:

One handling the first forward pass where no past key values have been cached yet - thus not taking them as input.
One handling the following forward pass where past key values have been cached, thus taking them as input.

This release introduces the support in the ONNX export and in ORTModelForCausalLM of a single ONNX handling both steps of the decoding. This allows to reduce memory usage, as weights are not duplicated between two separate models during inference.

Using a single ONNX for decoders can be used by passing use_merged=True to ORTModelForCausalLM.from_pretrained, loading directly from a PyTorch model:

from optimum.onnxruntime import ORTModelForCausalLM

model = ORTModelForCausalLM.from_pretrained("gpt2", export=True, use_merged=True)

Alternatively, using a single ONNX for decoders is the default behavior in the ONNX export, that can later be used for example with ORTModelForCausalLM, the command optimum-cli export onnx --model gpt2 gpt2_onnx/ will produce:

└── gpt2_onnx
    ├── config.json
    ├── decoder_model_merged.onnx
    ├── decoder_model.onnx
    ├── decoder_with_past_model.onnx
    ├── merges.txt
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    ├── tokenizer.json
    └── vocab.json

The decoder_model.onnx and decoder_with_past_model.onnx are kept separate for backward compatibility, but during inference using solely decoder_model_merged.onnx is enough.

Enable inference with a merged decoder in ORTModelForCausalLM by @JingyaHuang in #647

Single-file ORTModel accept numpy arrays

ORTModel accept numpy arrays as inputs, in addition to PyTorch tensors. This is only the case for models that use a single ONNX.

Accept numpy.ndarray as input and output to ORTModel by @fxmarty in #790

ORTOptimizer support for ORTModelForCausalLM

ORTOptimizer support ORTModelForCausalLM by @fxmarty in #794
Support IO Binding for merged decoder by @fxmarty in #797

Breaking changes

In the ONNX export, exporting models in several ONNX (encoder, decoder) is now the default behavior: #747. The old behavior is still accessible with --monolith.
In decoders, reusing past key values is now the default in the ONNX export: #748. The old behavior is still accessible by explicitly passing, for example, --task causal-lm instead of --task causal-lm-with-past.
BigBird support in the ONNX export is removed, due to the block_sparse attention type being written in pure numpy in Transformers, and hence not exportable to ONNX: #778
The parameter from_transformers of ORTModel.from_pretrained will be deprecated in favor of export.

Bugfixes and improvements

Fix disable shape inference for optimization by @regisss in #652
Fix uninformative message when passing use_cache=True to ORTModel and no ONNX with cache is available by @fxmarty in #650
Fix provider options when several providers are passed by @fxmarty in #653
Add TensorRT engine to ONNX Runtime GPU documentation by @fxmarty in #657
Improve documentation around ONNX export by @fxmarty in #666
minor updates on ONNX config guide by @mszsorondo in #662
Fix FlaubertOnnxConfig by @michaelbenayoun in #669
Use nvcr.io/nvidia/tensorrt image for GPU tests by @fxmarty in #660
Better Transformer doc fix by @HamidShojanazeri in #670
Add support for LongT5 optimization using ORT transformer optimizer script by @kunal-vaishnavi in #683
Add test for missing execution providers error messages by @fxmarty in #659
ONNX transformation to cast int64 constants to int32 when possible by @fxmarty in #655
Add missing normalized configs by @fxmarty in #694
Remove code duplication in ORTModel's load_model by @fxmarty in #695
Test more architectures in ORTModel by @fxmarty in #675
Avoid initializing unwanted attributes for ORTModel's having several inference sessions by @fxmarty in #696
Fix the ORTQuantizer loading from specific file by @echarlaix in #701
Add saving of diffusion model additional components ...

Contributors

jplu, sidthekidder, and 20 other contributors

Assets 2

13 Feb 16:54

fxmarty

v1.6.4

5da0411

v1.6.4: Patch release

Bugfix

Fix past key/value reuse in decoders following transformers 4.26.0 release and renaming: b9211d6
ONNX Runtime 1.14 support: #772

Full Changelog: v1.6.3...v1.6.4

Assets 2

25 Jan 17:28

JingyaHuang

v1.6.3

eba6afc

v1.6.3: Patch release

Fixes ORTTrainer for the inference with the ONNX Runtime backend.

Assets 2

25 Jan 11:38

fxmarty

v1.6.2

9f9d997

v1.6.2: Patch release

Hotfixes

Support generation config in ORTModel by @fxmarty in #651

Regressions

The export of speech-to-text architecture as a single ONNX file (that handles both the encoding and decoding) fails do to a regression with the latest transformers version: #721

Full Changelog: v1.6.1...v1.6.2

Contributors

fxmarty

Assets 2

23 Dec 20:32

fxmarty

v1.6.1

fdc08d7

v1.6.1: Patch release

Hotfixes

Revert breaking removal of EncoderOnnxConfig, DecoderOnnxConfig, _DecoderWithLMhead by @fxmarty in #643
Fix item access of some _TASKS_TO_AUTOMODELS by @fxmarty in #642

Full Changelog: v1.6.0...v1.6.1

Contributors

fxmarty

Assets 2

23 Dec 15:30

fxmarty

v1.6.0

06cdbc5

v1.6.0: Optimum CLI, Stable Diffusion ONNX export, BetterTransformer & ONNX support for more architectures

Optimum CLI

The Optimum command line interface is introduced, and is now the official entrypoint for the ONNX export. Example commands:

optimum-cli --help
optimum-cli export onnx --help
optimum-cli export onnx --model bert-base-uncased --task sequence-classification bert_onnx/

Add Optimum CLI backbone by @fxmarty in #593

Stable Diffusion ONNX export

Optimum now supports the ONNX export of stable diffusion models from the diffusers library:

optimum-cli export onnx --model runwayml/stable-diffusion-v1-5 sd_v15_onnx/

Add Stable Diffusion ONNX export by @echarlaix in #570

BetterTransformer support for more architectures

BetterTransformer integration includes new models in this release: CLIP, RemBERT, mBART, ViLT, FSMT

The complete list of supported models is available in the documentation.

[BT] Add Bettertransformer support for FSMT by @Sumanth077 in #494
[BT] add BetterTransformer support for ViLT architecture by @ka00ri in #508
Add MBart support for BetterTransformer by @ravenouse in #516
Add CLIP BetterTransformer by @fxmarty in #534
Add BetterTransformer support for RemBERT by @hchings in #545

ONNX export for more architectures

The ONNX export now supports Swin, MobileNet-v1, MobileNet-v2.

Add Swin support in exporters.onnx by @fxmarty in #528
[ONNX] add mobilenet support by @younesbelkada in #633

Extended ONNX export for encoder-decoder and decoder models

Encoder-decoder or decoder-only models normally making use of the generate() method in transformers can now be exported in several files using the --for-ort argument:

optimum-cli export onnx --model t5-small --task seq2seq-lm-with-past --for-ort t5_small_onnx

yielding:

.
└── t5_small_onnx
    ├── config.json
    ├── decoder_model.onnx
    ├── decoder_with_past_model.onnx
    ├── encoder_model.onnx
    ├── special_tokens_map.json
    ├── spiece.model
    ├── tokenizer_config.json
    └── tokenizer.json

Passing --for-ort, exported models are expected to be loadable directly into ORTModel.

Add ort export in exporters for encoder-decoder models by @mht-sharma in #497
Support decoder generated with --for-ort from optimum.exporters.onnx in ORTDecoder by @fxmarty in #554

Support for ONNX models with external data at export, optimization, quantization

The ONNX export from PyTorch normally creates external data in case the exported model is larger than 2 GB. This release introduces a better support for the export and use of large models, writting all external data into a .onnx_data file if necessary.

Handling ONNX models with external data by @NouamaneTazi in #586
Improve the compatibility dealing with large ONNX proto in ORTOptimizer and ORTQuantizer by @JingyaHuang in #332

ONNX Runtime API improvement

Various improvements to allow for a better user experience in the ONNX Runtime integration:

ORTModel, ORTModelDecoder and ORTModelForConditionalGeneration can now load any ONNX model files regardless of their names, allowing to load optimized and quantized models without having to specify a file name argument.
ORTModel.from_pretrained() with from_transformers=True now downloads and loads the model in a temporary directory instead of the cache, which was not a right place to store it.
ORTQuantizer.save_pretrained() now saves the model configuration and the preprocessor, making the exported directory usable end-to-end.
ORTOptimizer.save_pretrained() now saves the preprocessor, making the exported directory usable end-to-end.
ONNX Runtime integration API improvement by @michaelbenayoun in #515

Custom shapes support at ONNX export

The shape of the example input to provide for the export to ONNX can be overridden in case the validity of the ONNX model is sensitive to the shape used during the export.

Read more: optimum-cli export onnx --help

Support custom shapes for dummy inputs by @fxmarty in #522
Support for custom input shapes in exporters onnx by @fxmarty in #575

Enable `use_cache=True` for ORTModelForCausalLM

Reusing past key values for models using ORTModelForCausalLM (e.g. gpt2) is now possible using use_cache=True, avoiding to recompute them at each iteration of the decoding:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = ORTModelForCausalLM.from_pretrained("gpt2", from_transformers=True, use_cache=True)

inputs = tokenizer("My name is Arthur and I live in", return_tensors="pt")

gen_tokens = model.generate(**inputs)
tokenizer.batch_decode(gen_tokens)

Enable past_key_values for ORTModelForCausalLM by @echarlaix in #326

IO binding support for ORTModelForCustomTasks

ORTModelForCustomTasks now supports IO Binding when using CUDAExecutionProvider.

Add IO binding support for custom ORTModel by @JingyaHuang in #447

Experimental support to merge ONNX decoder with/without past key values

Along with --for-ort, when passing --task causal-lm-with-past , --task seq2seq-with-past or --task speech2seq-lm-with-past during the ONNX export exports two models: one not using the previously computed keys/values, and one using them.

An experimental support is introduced to merge the two models in one. Example:

optimum-cli export onnx --model t5-small --task seq2seq-lm-with-past --for-ort t5_onnx/

import onnx
from optimum.onnx import merge_decoders

decoder = onnx.load("t5_onnx/decoder_model.onnx")
decoder_with_past = onnx.load("t5_onnx/decoder_with_past_model.onnx")

merged_model = merge_decoders(decoder, decoder_with_past)
onnx.save(merged_model, "t5_onnx/decoder_merged_model.onnx")

Merge ONNX decoder models by @JingyaHuang in #587

Major bugs fixed

Fix BetterTransformer with padding="max_length" by @fxmarty in #543
Fix non-nesting bug in BetterTransformer integration by @younesbelkada in #637

Other changes, bugfixes and improvements

Fix doc-builder premission error by @mishig25 in #482
Fix doc build pr premissions by @mishig25 in #484
Re-order the task manager doc by @michaelbenayoun in #483
Fix whisper device for gpu test by @fxmarty in #486
Fix tensorflow CI by @fxmarty in #489
Fix PR doc generation by @regisss in #495
Fix broken links in the doc by @fxmarty in #499
Update iobinding ORT encoder whisper by @mht-sharma in #498
fix NormalizedConfig init error message by @PaulQbFeng in #500
Change import structure for ORTModel by @fxmarty in #456
[BT] Fix failing CI tests by @younesbelkada in #501
Remove redundant condition statement in ORTDecoder(Seq2seq) by @JingyaHuang in #504
[BT] put decorator on the correct place by @younesbelkada in #509
[BT] clearer error message for norm_first by @younesbelkada in #510
Deprecate PyTorch 1.12. for BetterTransformer by @fxmarty in #513
Fix ORTModelForSeq2SeqLM test by @fxmarty in #455
Clearer error messages when initilizing the requested ONNX Runtime execution provider fails by @fxmarty in #514
[BT] Fix doc bugs by @younesbelkada in #517
Replace sklearn by scikit-learn by @lesteve in #502
ORTModel uses optimum.exporters.onnx by @michaelbenayoun in #490
Cleanup deprecated ONNX Runtime training docker files by @JingyaHuang in #523
Added support for Tapas Model by @juheon...

Contributors

lesteve, fxmarty, and 18 other contributors

Assets 2

Releases: huggingface/optimum

v1.8.3: Patch release

Contributors

v1.8: extended BetterTransformer support, ONNX merged seq2seq models

Extended BetterTransformer support

ONNX merged seq2seq models

New models in the ONNX export

Major bugfix

Potentially breaking changes

What's Changed

New Contributors

Contributors

v1.7.3: Patch release for PyTorch 2.0 and transformers 4.27.0

Breaking change: constant outputs removed from ONNX encoder-decoder models

torch.nn.functional.scaled_dot_product_attention support for decoders in BetterTransformer

New architectures in the ONNX export

(WIP) TFLite export with quantization support

Bugfixes and improvements

Contributors

v1.7.1: Patch release

v1.7.0: ONNX export extension, TFLite export, single-ONNX decoding, ONNX Runtime extension for audio, vision tasks, stable diffusion

New models supported in the ONNX export

New models supported in BetterTransformer

Additional tasks supported in the ONNX Runtime integration

Support of the ONNX export from PyTorch on float16

TFLite export

ONNX Runtime optimization and quantization directly in the CLI

ORTModelForCausalLM supports decoding with a single ONNX

Single-file ORTModel accept numpy arrays

ORTOptimizer support for ORTModelForCausalLM

Breaking changes

Bugfixes and improvements

Contributors

v1.6.4: Patch release

Bugfix

v1.6.3: Patch release

v1.6.2: Patch release

Hotfixes

Regressions

Contributors

v1.6.1: Patch release

Hotfixes

Contributors

v1.6.0: Optimum CLI, Stable Diffusion ONNX export, BetterTransformer & ONNX support for more architectures

Optimum CLI

Stable Diffusion ONNX export

BetterTransformer support for more architectures

ONNX export for more architectures

Extended ONNX export for encoder-decoder and decoder models

Support for ONNX models with external data at export, optimization, quantization

ONNX Runtime API improvement

Custom shapes support at ONNX export

Enable use_cache=True for ORTModelForCausalLM

IO binding support for ORTModelForCustomTasks

Experimental support to merge ONNX decoder with/without past key values

Major bugs fixed

Other changes, bugfixes and improvements

Contributors

`torch.nn.functional.scaled_dot_product_attention` support for decoders in BetterTransformer

Enable `use_cache=True` for ORTModelForCausalLM