24 Mar 18:13

fbc5bf1

BART, organizations, community notebooks, lightning examples, dropping Python 3.5

New Model: BART (added by @sshleifer)

Bart is one of the first Seq2Seq models in the library, and achieves state of the art results on text generation tasks, like abstractive summarization.
Three sets of pretrained weights are released:

bart-large: the pretrained base model
bart-large-cnn: the base model finetuned on the CNN/Daily Mail Abstractive Summarization Task
bart-large-mnli: the base model finetuned on the MNLI classification task.

paper
model pages are at https://huggingface.co/facebook
docs
blogpost

Big thanks to the original authors, especially Mike Lewis, Yinhan Liu, Naman Goyal who helped answer our questions.

Model sharing CLI: support for organizations

The huggingface API for model upload now supports organisations.

Notebooks (@mfuntowicz)

A few beginner-oriented notebooks were added to the library, aiming at demystifying the two libraries huggingface/transformers and huggingface/tokenizers. Contributors are welcome to contribute links to their notebooks as well.

pytorch-lightning examples (@srush)

Examples leveraging pytorch-lightning were added, led by @srush.
The first example that was added is the NER example.
The second example is a lightning GLUE example, added by @nateraw.

New model architectures: CamembertForQuestionAnswering,

CamembertForQuestionAnswering was added to the library and to the SQuAD script @maximeilluin
AlbertForTokenClassification was added to the library and to the NER example @marma

Multiple fixes were done on the fast tokenizers to make them entirely compatible with the python tokenizers (@mfuntowicz)

Most of these fixes were done in the patch 2.5.1. Fast tokenizers should now have the exact same API as the python ones, with some additional functionalities.

Docker images (@mfuntowicz)

Docker images for transformers were added.

Generation overhaul (@patrickvonplaten)

Special token IDs logic were improved in run_generation and in corresponding tests.
Slow tests for generation were added for pre-trained LM models
Greedy generation when doing beam search
Sampling when doing beam search
Generate functionality was added to TF2: with beam search, greedy generation and sampling.
Integration tests were added
no_repeat_ngram_size kwarg to avoid redundant generations (@sshleifer)

Encoding methods now output only model-specific inputs

Models such as DistilBERT and RoBERTa do not make use of token type IDs. These inputs are not returned by the encoding methods anymore, except if explicitly mentioned during the tokenizer initialization.

Pipelines support summarization (@sshleifer)

The default architecture is bart-large-cnn, with the generation parameters published in the paper.

Models may now re-use the cache every time without prompting S3 (@BramVanroy)

Previously all attempts to load a model from a pre-trained checkpoint would check that the S3 etag corresponds to the one hosted locally. This has been updated so that an argument local_files_only prevents this, which can be useful when a firewall is involved.

Usage examples for common tasks (@LysandreJik)

In a continuing effort to onboard new users (new to the lib or new to NLP in general), some usage examples were added to the documentation. These usage examples showcase how to do inference on several tasks:

NER
Sequence classification
Question Answering
Causal Language Modeling
Masked Language Modeling

Test suite on GPU (@julien-c)

CI now runs on GPU. PyTorch and TensorFlow.

Padding token ID needs to be set in order to pad (@patrickvonplaten)

Older tokenizers could pad even when no padding token was defined, which has been updated in this version to match the expected behavior, which is the FastTokenizers' behavior: add a pad token or raise an error when trying to batch without one.

Python >= 3.6

We're now dropping Python 3.5 support.

Community additions/bug-fixes/improvements

Added a warning when using add_special_tokens with the fast tokenizer methods of encoding (@LysandreJik)
encode_plus was modified and tested to have the exact same behaviour as encode, but batches input
Cleanup DistilBERT code (@guillaume-be)
Only use F.gelu for torch >= 1.4.0 (@sshleifer)
Added a get_vocab method to tokenizers, which can be used to retrieve all the vocabulary from the tokenizers. (@joeddav)
Correct behaviour of special_tokens_mask when add_special_tokens=False (@LysandreJik)
Removed untested Model2LSTM and Model2Model which was not working
kwargs were passed to both model and configuration in AutoModels, which made the model crash (@LysandreJik)
Correct transfo-xl tokenization regarding punctions (@patrickvonplaten)
Better docstrings for XLNet (@patrickvonplaten)
Better operations for TPU support (@srush)
XLM-R tokenizer is now tested and bug-free (@LysandreJik)
XLM-R model and tokenizer now have integration tests (@patrickvonplaten)
Better documentation for tokenizers and pipelines (@LysandreJik)
All tests (slow and non-slow) now pass (@julien-c, @LysandreJik, @patrickvonplaten, @sshleifer, @thomwolf)
Correct attention mask with GPT-2 when using past (@patrickvonplaten)
fix n_gpu count when no_cuda flag is activated in all examples (@VictorSanh)
Test TF GPT2 for correct behaviour regarding the past and attn mask variable (@patrickvonplaten)
Fixed bug where some missing keys would not be identified (@LysandreJik)
Correct num_labels initialization (@LysandreJik)
Model special tokens were added to the pretrained configurations (@patrickvonplaten)
QA models for XLNet, XLM and FlauBERT are now set to their "simple" architectures when using the pipeline.
GPT-2 XL was added to TensorFlow (@patrickvonplaten)
NER PL example updated (@shubhamagarwal92)
Improved Error message when loading config/model with .from_pretrained() (@patrickvonplaten, @julien-c)
Cleaner special token initialization in modeling_xxx.py (@patrickvonplaten)
Fixed the learning rate scheduler placement in the run_ner.py script @erip
Use AutoModels in examples (@julien-c, @lifefeel)

Assets 2

24 Feb 23:53

LysandreJik

v2.5.1

b90745c

Patch v2.5.1: AutoTokenizer slow by default, bug fixes

AutoTokenizer

AutoTokenizer has been put back to False by default so as to not have a breaking change between 2.4.x and 2.5.x

Fast tokenizers

Bug fixes

Slow tokenizers

Bug fixes related to batch_encode_plus

Assets 2

19 Feb 16:54

LysandreJik

v2.5.0

fb560dc

Rust Tokenizers, DistilBERT base cased, Model cards

Rust tokenizers (@mfuntowicz, @n1t0 )

Tokenizers for Bert, Roberta, OpenAI GPT, OpenAI GPT2, TransformerXL are now leveraging tokenizers library for fast tokenization 🚀
AutoTokenizer now defaults to fast tokenizers implementation when available
Calling batch_encode_plus on fast version of tokenizers will make better usage of CPU-cores.
Tokenizers leveraging native implementation will use all the CPU-cores by default when calling batch_encode_plus. You can change this behavior by setting the environment variable RAYON_NUM_THREADS = N
An exception is raised when tokenizing an input with pad_to_max_length=True but no padding token is defined.

Known Issues:

RoBERTa fast tokenizer implementation has slightly different output when compared to the original Python tokenizer (< 1%).
Squad example are not currently compatible with the new fast tokenizers thus, it will default to plain-old Python one.

DistilBERT base cased (@VictorSanh)

The distilled version of the bert-base-cased BERT checkpoint has been released.

Model cards (@julien-c)

Model cards are now stored directly in the repository

CLI script for environment information (@BramVanroy)

We now host a CLI script that gathers all the environment information when reporting an issue. The issue templates have been updated accordingly.

Contributors visible on repository (@clmnt)

The main contributors as identified by Sourcerer are now visible directly on the repository.

From fine-tuning to pre-training (@julien-c )

The language fine-tuning script has been renamed from run_lm_finetuning to run_language_modeling as it is now also able to train language models from scratch.

Extracting archives now available from `cached_path` (@thomwolf )

Slight modification to cached_path so that zip and tar archives can be automatically extracted.

archives are extracted in the same directory than the (possibly downloaded) archive in a created extraction directory named from the archive.
automatic extraction is activated by setting extract_compressed_file=True when calling cached_file.
the extraction directory is re-used to avoid extracting the archive again unless we set force_extract=True, in which case the cached extraction directory is removed and the archive is extracted again.

New activations file (@sshleifer )

Several activation functions (relu, swish, gelu, tanh and gelu_new) can now be accessed from the activations.py file and be used in the different PyTorch models.

Community additions/bug-fixes/improvements

Remove redundant hidden states that broke encoder-decoder architectures (@LysandreJik )
Cleaner and more readable code in test_attention_weights (@sshleifer)
XLM can be trained on SQuAD in different languages (@yuvalpinter)
Improve test coverage on several models that were ill-tested (@LysandreJik)
Fix issue where TFGPT2 could not be saved (@neonbjb )
Multi-GPU evaluation on run_glue now behaves correctly (@peteriz )
Fix issue with TransfoXL tokenizer that couldn't be saved (@dchurchwell)
More Robust conversion from ALBERT/BERT original checkpoints to huggingface/transformers models (@monologg )
FlauBERT bug fix; only add langs embeddings when there is more than one language handled by the model (@LysandreJik )
Fix CircleCI error with TensorFlow 2.1.0 (@mfuntowicz )
More specific testing advice in contributing (@sshleifer )
BERT decoder: Fix failure with the default attention mask (@asivokon )
Fix a few issues regarding the data preprocessing in run_language_modeling (@LysandreJik )
Fix an issue with leading spaces and the RobertaTokenizer (@joeddav )
Added pipeline: TokenClassificationPipeline, which is an alias over NerPipeline (@julien-c )

Assets 2

31 Jan 19:58

LysandreJik

v2.4.1

d426b58

Patch v2.4.1: FlauBERT for AutoModel and AutoTokenizer

Patched an issue where FlauBERT couldn't be loaded with AutoModel and AutoTokenizer classes.

Assets 2

31 Jan 14:55

LysandreJik

v2.4.0

6664ea9

FlauBERT, MMBT, UmBERTo, Dutch model, improved documentation, training from scratch, clean Python code

FlauBERT, MMBT, UmBERTo

MMBT was added to the list of available models, as the first multi-modal model to make it in the library. It can accept a transformer model as well as a computer vision model, in order to classify image and text. The MMBT Model is from Supervised Multimodal Bitransformers for Classifying Images and Text by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine (https://github.com/facebookresearch/mmbt/)
Added by @suvrat96.
A new Dutch BERT model was added under the wietsedv/bert-base-dutch-cased identifier. Added by @wietsedv. Model page
UmBERTo, a Roberta-based Language Model trained on large Italian Corpora. Model page
A new French model was added, FlauBERT, based on XLM. The FlauBERT model is from FlauBERT: Unsupervised Language Model Pre-training for French (https://github.com/getalp/Flaubert). Four checkpoints are added: small size, base uncased, base cased and large. Model page

New TF architectures (@jplu)

TensorFlow XLM-RoBERTa was added (@jplu )
TensorFlow CamemBERT was added (@jplu )

Python best practices (@aaugustin)

Greatly improved the quality of the source code by leveraging black, isort and flake8. A test was added, check_code_quality, which checks that the contributions respect the contribution guidelines related to those tools.
Similarly, optional imports are better handled and raise more precise errors.
Cleaned up several requirements files, updated the contribution guidelines and rely on setup.py for the necessary dev dependencies.
you can clean up your code for a PR with (more details in CONTRIBUTING.md):

make style
make quality

Documentation (@LysandreJik)

The documentation was uniformized and some better guidelines have been defined. This work is part of an ongoing effort of making transformers accessible to a larger audience. A glossary has been added, adding definitions for most frequently used inputs.

Furthermore, some tips are given concerning each model in their documentation pages.

The code samples are now tested on a weekly basis alongside other slow tests.

Improved repository structure (@aaugustin)

The source code was moved from ./transformers to ./src/transformers. Since it changes the location of the source code, contributors must update their local development environment by uninstalling and re-installing the library.

Python 2 is not supported anymore (@aaugustin )

Version 2.3.0 was the last version to support Python 2. As we begin the year 2020, official Python 2 support has been dropped.

Parallel testing (@aaugustin)

Tests can now be run in parallel

Sampling sequence generator (@rlouf, @thomwolf )

An abstract method was added to PreTrainedModel, which is implemented in all models trained with CLM. This abstract method is generate, which offers an API for text generation:

with/without a prompt
with/without beam search
with/without greedy decoding/sampling
with any (and combination) of top-k/top-p/penalized repetitions

Resuming training when interrupted (@bkkaggle )

Previously, when stopping a training the only saved values would be the model weights/configuration. Now the different scripts save several other values: the global step, current epoch, and the steps trained in the current epoch. When resuming a training, all those values will be leveraged to correctly resume the training.

This applies to the following scripts: run_glue, run_squad, run_ner, run_xnli.

CLI (@julien-c , @mfuntowicz )

Model upload

The CLI now has better documentation.
Files can now be removed.

Pipelines

Expose the number of underlying FastAPI workers
Async forward methods
Fixed the environment variables so that they don't fight each other anymore (USE_TF, USE_TORCH)

Training from scratch (@julien-c )

The run_lm_finetuning.py script now handles training from scratch.

Changes in the configuration (@julien-c )

The configuration files now contain the architecture they're referring to. There is no need to have the architecture in the file name as it was necessary before. This should ease the naming of community models.

New Auto models (@thomwolf )

A new type of AutoModel was added: AutoModelForPreTraining. This model returns the base model that was used during the pre-training. For most models it is the base model alongside a language modeling head, whereas for others it is a different model, e.g. BertForPreTraining for BERT.

HANS dataset (@ns-moosavi)

The HANS dataset was added to the examples. It allows for testing a model with adversarial evaluation of natural language.

[BREAKING CHANGES]

Ignored indices in PyTorch loss computing (@LysandreJik)

When using PyTorch, certain values can be ignored when computing the loss. In order for the loss function to understand which indices must be ignored, those have to be set to a certain value. Most of our models required those indices to be set to -1. We decided to set this value to -100 instead as it is PyTorch's default value. This removes the discrepancy between user-implemented losses and the losses integrated in the models.

Further help from @r0mainK.

Community additions/bug-fixes/improvements

Can now save and load PreTrainedEncoderDecoder objects (@TheEdoardo93)
RoBERTa now bears more similarity to the FairSeq implementation (@DomHudson, @thomwolf)
Examples now better reflect the defaults of the encoding methods (@enzoampil)
TFXLNet now has a correct input mask (@thomwolf)
run_squad was fixed to allow better training for XLNet (@importpandas )
tokenization performance improvement (3-8x) (@mandubian)
RoBERTa was added to the run_squad script (@erenup)
Fixed the special and added tokens tokenization (@vitaliyradchenko)
Fixed an issue with language generation for XLM when having a batch size superior to 1 (@patrickvonplaten)
Fixed an issue with the generate method which did not correctly handle the repetition penalty (@patrickvonplaten)
Completed the documentation for repeating_words_penalty_for_language_generation (@patrickvonplaten)
run_generation now leverages cached past input for models that have access to it (@patrickvonplaten)
Finally manage to patch a rarely occurring bug with DistilBERT, eventually named DistilHeisenBug or HeisenDistilBug (@LysandreJik, with the help of @julien-c and @thomwolf).
Fixed an import error in run_tf_ner (@karajan1001).
Feature conversion for GLUE now has improved logging messages (@simonepri)
Patched an issue with GPUs and run_generation (@alberduris)
Added support for ALBERT and XLMRoBERTa to run_glue
Fixed an issue with the DistilBERT tokenizer not loading correct configurations (@LysandreJik)
Updated the SQuAD for distillation script to leverage the new SQuAD API (@LysandreJik)
Fixed an issue with T5 related to its rp_bucket (@mschrimpf)
PPLM now supports repetition penalties (@IWillPull)
Modified the QA pipeline to consider all features for each example (@Perseus14)
Patched an issue with a file lock (@dimagalat @aaugustin)
The bias should be resized with the weights when resizing a vocabulary projection layer with a new vocabulary size (@LysandreJik)
Fixed misleading token type IDs for RoBERTa. It doesn't leverage token type IDs and this has been clarified in the documentation (@LysandreJik ) Same for XLM-R (@maksym-del).
Fixed the prepare_for_model when tensorizing and returning token type IDs (@LysandreJik).
Fixed the XLNet model which wouldn't work with torch 1.4 (@julien-c)
Fetch all possible files remotely (@julien-c )
BERT's BasicTokenizer respects never_split parameters (@DeNeutoy)
Add lower bound to tqdm dependency @brendan-ai2
Fixed glue processors failing on tensorflow datasets (@neonbjb)
XLMRobertaTokenizer can now be serialized (@brandenchan)
A classifier dropout was added to ALBERT (@peteriz)
The ALBERT configuration for v2 models were fixed to be identical to those output by Google (@LysandreJik )

Assets 2

20 Dec 21:40

LysandreJik

v2.3.0

a436574

Downstream NLP task API (feature extraction, text classification, NER, QA), Command-Line Interface and Serving – models: T5 – community-added models: Japanese & Finnish BERT, PPLM, XLM-R

New class `Pipeline` (beta): easily run and use models on down-stream NLP tasks

We have added a new class called Pipeline to simply run and use models for several down-stream NLP tasks.

A Pipeline is just a tokenizer + model wrapped so they can take human-readable inputs and output human-readable results.

The Pipeline will take care of :
tokenizing inputs strings => convert in tensors => run in the model => post-process output

Currently, we have added the following pipelines with a default model for each:

feature extraction (can be used with any pretrained and finetuned models)
inputs: strings/list of strings – output: list of floats (last hidden-states of the model for each token)
sentiment classification (DistilBert model fine-tuned on SST-2)
inputs: strings/list of strings – output: list of dict with label/score of the top class
Named Entity Recognition (XLM-R finetuned on CoNLL2003 by the awesome @stefan-it), and
inputs: strings/list of strings – output: list of dict with label/entities/position of the named-entities
Question Answering (Bert Large whole-word version fine-tuned on SQuAD 1.0)
inputs: dict of strings/list of dict of strings – output: list of dict with text/position of the answers

There are three ways to use pipelines:

in python:

from transformers import pipeline

# Test the default model for QA (Bert large finetuned on SQuAD 1.0)
nlp = pipeline('question-answering')
nlp(question= "Where does Amy live ?", context="Amy lives in Amsterdam.")
>>> {'answer': 'Amsterdam', 'score': 0.9657156007786263, 'start': 13, 'end': 21}

# Test a specific model for NER (XLM-R finetuned by @stefan-it on CoNLL03 English)
nlp = pipeline('ner', model='xlm-roberta-large-finetuned-conll03-english')
nlp("My name is Amy. I live in Paris.")
>>> [{'word': 'Amy', 'score': 0.9999586939811707, 'entity': 'I-PER'},
     {'word': 'Paris', 'score': 0.9999983310699463, 'entity': 'I-LOC'}]

in bash (using the command-line interface)

bash $ echo -e "Where does Amy live?\tAmy lives in Amsterdam" | transformers-cli run --task question-answering
{'score': 0.9657156007786263, 'start': 13, 'end': 22, 'answer': 'Amsterdam'}

as a REST API

transformers-cli serve --task question-answering

This new feature is currently in beta and will evolve in the coming weeks.

CLI tool to upload and share community models

Users can now create accounts on the huggingface.co website and then login using the transformers CLI. Doing so allows users to upload their models to our S3 in their respective directories, so that other users may download said models and use them in their tasks.

Users may upload files or directories.

It's been tested by @stefan-it for a German BERT and by @singletongue for a Japanese BERT.

New model architectures: T5, Japanese BERT, PPLM, XLM-RoBERTa, Finnish BERT

T5 (Pytorch & TF) (from Google) released with the paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
Japanese BERT (Pytorch & TF) from CL-tohoku, implemented by @singletongue
PPLM (Pytorch) (from Uber AI) released with the paper Plug and Play Language Models: a Simple Approach to Controlled Text Generation by Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, Rosanne Liu.
XLM-RoBERTa (Pytorch & TF) (from FAIR, implemented by @stefan-it) released with the paper Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov
Finnish BERT (Pytorch & TF) (from TurkuNLP) released with the paper Multilingual is not enough: BERT for Finnish by Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Juhani Luotolahti, Tapio Salakoski, Filip Ginter, Sampo Pyysalo

Refactoring the SQuAD example

The run_squad script has been massively refactored. The reasons are the following:

it was made to work with only a few models (BERT, XLNet, XLM and DistilBERT), which had three different ways of encoding sequences. The script had to be individually modified in order to train different models, which would not scale as other models are added to the library.
the utilities did not rely on the QOL adjustments that were made to the encoding methods these past months.

It now leverages the full capacity of encode_plus, easing the addition of new models to the script. A new method squad_convert_examples_to_features encapsulates all of the tokenization.
This method can handle tensorflow_datasets as well as squad v1 json files and squad v2 json files.

ALBERT was added to the SQuAD script

BertAbs summarization

A contribution by @rlouf building on the encoder-decoder mechanism to do abstractive summarization.

Utilities to load the CNN/DailyMail dataset
BertAbs now usable as a traditional library model (using from_pretrained())
ROUGE evaluation

New Models

Additional architectures

@alexzubiaga added XLNetForTokenClassification and TFXLNetForTokenClassification

New model cards

Community additions/bug-fixes/improvements

Added mish activation function @digantamisra98
run_bertology.py was updated with correct imports and the ability to overwrite the cache
Training can be exited and relaunched safely, while keeping the epochs, global steps, scheduler steps and other variables in run_lm_finetuning.py @bkkaggle
Tests now run on cuda @aaugustin @julien-c
Cleaned up the pytorch to tf conversion script @thomwolf
Progress indicator improvements when downloading pre-trained models @leopd
from_pretrained() can now load from urls directly.
New tests to check that all files are accessible on HuggingFace's S3 @rlouf
Updated tf.shape and tensor.shape to all use shape_list @thomwolf
Valohai integration @thomwolf
Always use SequentialSampler in run_squad.py @ethanjperez
Stop using GPU when importing transformers @ondewo
Fixed the XLNet attention output @roskoN
Several QOL adjustments: removed dead code, deep cleaned tests and removed pytest dependency @aaugustin
Fixed an issue with the Camembert tokenization @thomwolf
Correctly create an encoder attention mask from the shape of the hidden states @rlouf
Fixed a non-deterministic behavior when encoding and decoding empty strings @pglock
Fixing tensor creation in encode_plus @LysandreJik
Remove usage of tf.mean which does not exist in TF2 @LysandreJik
A segmentation fault error was fixed (due to scipy 1.4.0) @LysandreJik
Start sunsetting support of Python 2
An example usage of Model2Model was added to the quickstart.

Assets 2

20 Dec 14:53

LysandreJik

v2.2.2

7bd11dd

Bug fixes

Patched error where the tokenizers would split the special tokens.

Assets 2

03 Dec 16:23

LysandreJik

v2.2.1

8101924

Bug fixes related to input shape in TensorFlow and tokenization messages

Input shapes

This patch fixes a bug related to the input shape in several models in TensorFlow.

Tokenization message

A tokenization message was too present and overloaded the output, hiding the relevant information. It was removed.

Assets 2

26 Nov 19:26

LysandreJik

v2.2.0

ae98d45

ALBERT, CamemBERT, DistilRoberta, GPT-2 XL, and Encoder-Decoder architectures

New model architectures: ALBERT, CamemBERT, GPT2-XL, DistilRoberta

Four new models have been added in v2.2.0

ALBERT (Pytorch & TF) (from Google Research and the Toyota Technological Institute at Chicago) released with the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
CamemBERT (Pytorch) (from Facebook AI Research, INRIA, and La Sorbonne Université), as the first large-scale Transformer language model. Released alongside the paper CamemBERT: a Tasty French Language Model by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suarez, Yoann Dupont, Laurent Romary, Eric Villemonte de la Clergerie, Djame Seddah, and Benoît Sagot. It was added by @louismartin with the help of @julien-c.
DistilRoberta (Pytorch & TF) from @VictorSanh as the third distilled model after DistilBERT and DistilGPT-2.
GPT-2 XL (Pytorch & TF) as the last GPT-2 checkpoint released by OpenAI

Encoder-Decoder architectures

We welcome the possibility to create fully seq2seq models by incorporating Encoder-Decoder architectures using a PreTrainedEncoderDecoder class that can be initialized from pre-trained models. The base BERT class has be modified so that it may behave as a decoder.

Furthermore, a Model2Model class that simplifies the definition of an encoder-decoder when both encoder and decoder are based on the same model has been added. @rlouf

Benchmarks and performance improvements

Works by @tlkh and @LysandreJik aiming to benchmark the library models with different technologies: with TensorFlow and Pytorch, with mixed precision (AMP and FP-16) and with model tracing (Torchscript and XLA). A new section was created in the documentation: benchmarks pointing to Google sheets with the results.

Breaking changes

Tokenizers now add special tokens by default. @LysandreJik

New model templates

Model templates to ease the addition of new models to the library have been added. @thomwolf

Inputs Embeddings

A new input has been added to all models' forward (for Pytorch) and call (for TensorFlow) methods. These inputs_embeds are a direct embedded representation. This is useful as it gives more control over how to convert input_ids indices into associated vectors than the model's internal embedding lookup matrix. @julien-c

Getters and setters for input and output embeddings

A new API for the input and output embeddings are available. These methods are model-independent and allow easy acquisition/modification of the models' embeddings. @thomwolf

Additional architectures

New model architectures are available, namely: DistilBertForTokenClassification, CamembertForTokenClassification @stefan-it

Community additions/bug-fixes/improvements

The Fairseq RoBERTa model conversion script has been patched. @louismartin
einsum now runs in FP-16 in the library's examples @slayton58
In-depth work on the squad script for XLNet to reproduce the original paper's results @hlums
Additional improvements on the run_squad script by @WilliamTambellini, @orena1
The run_generation script has seen several improvements by @leo-du
The RoBERTaTensorFlow model has been patched for several use-cases: TPU and keras.fit @LysandreJik
The documentation is now versioned, links are available on the github readme @LysandreJik
The run_ner script has seen several improvements @mmaybeno, @oneraghavan, @manansanghi
The run_tf_glue script now works for all GLUE tasks @LysandreJik
The run_lm_finetuning script now correctly evaluates perplexity on MLM tasks @altsoph
An issue related to the XLM TensorFlow implementation's training has been fixed @tlkh
run_bertology has been updated to be closer to the run_glue example @adrianbg
Fixed added special tokens in decoded sequences @LysandreJik
Several performance improvements have been done to the tokenizers @iedmrc
A memory leak has been identified and patched in the library's schedulers @rlouf
Correct warning when encoding a sequence too long while specifying a maximum length @LysandreJik
Resizing the token embeddings now works as expected in the run_lm_finetuning script @iedmrc
The difference in versions between Pypi/source in order to run the examples has been clarified @rlouf

Assets 2

11 Oct 14:50

LysandreJik

v2.1.1

3ddce1d

CTRL, DistilGPT-2, Pytorch TPU, tokenizer enhancements, guideline requirements

New model architectures: CTRL, DistilGPT-2

Two new models have been added since release 2.0.

CTRL (from Salesforce) released with the paper CTRL: A Conditional Transformer Language Model for Controllable Generation, by Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher. This model has been added to the library by @keskarnitish with the help of @thomwolf.
DistilGPT-2 (from HuggingFace), as the second distilled model after DistilBERT in version 1.2.0. Released alongside the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Distillation

Several updates have been made to the distillation script, including the possibility to distill GPT-2 and to distill on the SQuAD task. By @VictorSanh.

Pytorch TPU support

The run_glue.py example script can now run on a Pytorch TPU.

Updates to example scripts

Several example scripts have been improved and refactored to use the full potential of the new tokenizer functions:

run_multiple_choice.py has been refactored to include encode_plus by @julien-c and @erenup
run_lm_finetuning.py has been improved with the help of @dennymarcels, @jinoobaek-qz and @LysandreJik
run_glue.py has been improved with the help of @brian41005

QOL enhancements on the tokenizer

Enhancements have been made on the tokenizers. Two new methods have been added: get_special_tokens_mask and truncate_sequences .

The former returns a mask indicating which tokens are special tokens in a token list, and which are tokens from the initial sequences. The latter truncate sequences according to a strategy.

Both of those methods are called by the encode_plus method, which itself is called by the encode method. The encode_plus now returns a larger dictionary which holds information about the special tokens, as well as the overflowing tokens.

Thanks to @julien-c, @thomwolf, and @LysandreJik for these additions.

New German BERT models

Support for new German BERT models (cased and uncased) from @stefan-it @dbmdz

Breaking changes

The two methods add_special_tokens_single_sequence and add_special_tokens_sequence_pair have been removed. They have been replaced by the single method build_inputs_with_special_tokens which has a more comprehensible name and manages both sequence singletons and pairs.
The boolean parameter truncate_first_sequence has been removed in tokenizers' encode and encode_plus methods, being replaced by a strategy in the form of a string: 'longest_first', 'only_second', 'only_first' or 'do_not_truncate' are accepted strategies.
When the encode or encode_plus methods are called with a specified max_length, the sequences will now always be truncated or throw an error if overflowing.

Guidelines and requirements

New contributing guidelines have been added, alongside library development requirements by @rlouf, the newest member of the HuggingFace team.

Community additions/bug-fixes/improvements

GLUE Processors have been refactored to handle inputs for all tasks coming from the tensorflow_datasets. This work has been done by @agrinh and @philipp-eisen.
The padding_idx is now correctly initialized to 1 in randomly initialized RoBERTa models. @ikuyamada
The documentation CSS has been adapted to work on older browsers. @TimYagan
An addition concerning the management of hidden states has been added to the README by @BramVanroy.
Integration of TF 2.0 models with other Keras modules @thomwolf
Past values can be opted-out @thomwolf

Assets 2

Releases: huggingface/transformers

BART, organizations, community notebooks, lightning examples, dropping Python 3.5

New Model: BART (added by @sshleifer)

Model sharing CLI: support for organizations

Notebooks (@mfuntowicz)

pytorch-lightning examples (@srush)

New model architectures: CamembertForQuestionAnswering,

Multiple fixes were done on the fast tokenizers to make them entirely compatible with the python tokenizers (@mfuntowicz)

Docker images (@mfuntowicz)

Generation overhaul (@patrickvonplaten)

Encoding methods now output only model-specific inputs

Pipelines support summarization (@sshleifer)

Models may now re-use the cache every time without prompting S3 (@BramVanroy)

Usage examples for common tasks (@LysandreJik)

Test suite on GPU (@julien-c)

Padding token ID needs to be set in order to pad (@patrickvonplaten)

Python >= 3.6

Community additions/bug-fixes/improvements

Patch v2.5.1: AutoTokenizer slow by default, bug fixes

AutoTokenizer

Fast tokenizers

Slow tokenizers

Rust Tokenizers, DistilBERT base cased, Model cards

Rust tokenizers (@mfuntowicz, @n1t0 )

DistilBERT base cased (@VictorSanh)

Model cards (@julien-c)

CLI script for environment information (@BramVanroy)

Contributors visible on repository (@clmnt)

From fine-tuning to pre-training (@julien-c )

Extracting archives now available from cached_path (@thomwolf )

New activations file (@sshleifer )

Community additions/bug-fixes/improvements

Patch v2.4.1: FlauBERT for AutoModel and AutoTokenizer

FlauBERT, MMBT, UmBERTo, Dutch model, improved documentation, training from scratch, clean Python code

FlauBERT, MMBT, UmBERTo

New TF architectures (@jplu)

Python best practices (@aaugustin)

Documentation (@LysandreJik)

Improved repository structure (@aaugustin)

Python 2 is not supported anymore (@aaugustin )

Parallel testing (@aaugustin)

Sampling sequence generator (@rlouf, @thomwolf )

Resuming training when interrupted (@bkkaggle )

CLI (@julien-c , @mfuntowicz )

Model upload

Pipelines

Training from scratch (@julien-c )

Changes in the configuration (@julien-c )

New Auto models (@thomwolf )

HANS dataset (@ns-moosavi)

[BREAKING CHANGES]

Ignored indices in PyTorch loss computing (@LysandreJik)

Community additions/bug-fixes/improvements

Downstream NLP task API (feature extraction, text classification, NER, QA), Command-Line Interface and Serving – models: T5 – community-added models: Japanese & Finnish BERT, PPLM, XLM-R

New class Pipeline (beta): easily run and use models on down-stream NLP tasks

CLI tool to upload and share community models

New model architectures: T5, Japanese BERT, PPLM, XLM-RoBERTa, Finnish BERT

Refactoring the SQuAD example

BertAbs summarization

New Models

Additional architectures

New model cards

Community additions/bug-fixes/improvements

Bug fixes

Bug fixes related to input shape in TensorFlow and tokenization messages

Input shapes

Tokenization message

ALBERT, CamemBERT, DistilRoberta, GPT-2 XL, and Encoder-Decoder architectures

New model architectures: ALBERT, CamemBERT, GPT2-XL, DistilRoberta

Encoder-Decoder architectures

Benchmarks and performance improvements

Breaking changes

New model templates

Inputs Embeddings

Getters and setters for input and output embeddings

Additional architectures

Community additions/bug-fixes/improvements

CTRL, DistilGPT-2, Pytorch TPU, tokenizer enhancements, guideline requirements

New model architectures: CTRL, DistilGPT-2

Distillation

Extracting archives now available from `cached_path` (@thomwolf )

New class `Pipeline` (beta): easily run and use models on down-stream NLP tasks