Skip to content

Commit

Permalink
[cherry-pick] Release 0.12.1 fixes (#744)
Browse files Browse the repository at this point in the history
* Avoid numerically unstable log (#694)

* fix QAT->Quant conversion of repeated Gemm layers with no activation QDQ (#698)

* Revert rn residual quant (#691)

* Revert ResNet definition to not quantize input to add op in residual branches.

* Correct typo.

Co-authored-by: Mark Kurtz <mark@neuralmagic.com>

* Fix: Add linebreak before 'Supplied' for better readability (#701)

* Bump notebook in /research/information_retrieval/doc2query (#679)

Bumps [notebook](http://jupyter.org) from 6.4.1 to 6.4.10.

---
updated-dependencies:
- dependency-name: notebook
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mark Kurtz <mark@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

* Added integration to masked_language_modeling training command (#707)

* Switch off fp16 on QAT start (#703)

* Switch off fp16 on QAT start

* address: review comments

* Disable fp16 when torch version is lesser than `1.9`

* Fix transformer prediction step (#716)

* Fix for prediction step when teacher model has more inputs than student.

* Updated signature of prediction_step method.

* Style and quality fixes.

* bump main to 0.13 (#696)

Co-authored-by: dhuang <dhuang@dhuangs-MacBook-Pro.local>

* Fix: default python log calls to debug level (#719)

* Feature/integrations (#688)

* added tutorials to root readme split by domain

* readme update

* edited text/structure

* grammar edits

* fix QATWrapper not properly overwritting qconfig properties for symmetric activations (#724)

* re-add fix symmetric zero points for unit8 quantization (#604) (#725)

* Fix 'self' and 'disable' not working for transformers distillation (#731)

* Click refactor for SparseML-PyTorch integration with Image Classification models (#711)

* Click refactor for SparseML-PyTorch integration

* Click refactor for `Pruning Sensitivity` analysis (#714)

* Click refactor for SparseML-PyTorch pr_sensitivity analysis integration

* Review comments from @KSGulin

* Click refactor for SparseML-PyTorch `lr-analysis` integration (#713)

* Click refactor for SparseML-PyTorch lr-analysis integration

* Review comments from @KSGulin

* Click refactor for SparseML PyTorch `export` integration (#712)

* Click refactor for SparseML-PyTorch export integration

* Review comments from @KSGulin

* Addressed all review comments from @bfineran, @dbogunowicz and @KSGulin

* Regenerate and Update the train-cli docstring due to changes in a few cli-args

* `nm_argparser.py` not needed anymore

* removed `nm_argparser.py` from init

* Remove All CLI args aliases and updated doctrings accordingly

* [Fix] Follow-up fix for #731 (Fix 'self' and 'disable' not working for transformers distillation) (#737)

* initial commit

* added more files and fixed quality

* Update trainer.py

* Added flag to exclude quantization of embedding activations. (#738)

* Added flag to exclude quantization of embedding activations.

* Updated testing to contemplate quantize_embedding_activations flag.

* Updated testing to contemplate quantize_embedding_activations flag.

* Updated debugging

* Revert "Updated debugging"

This reverts commit 449703d.

* Corrected order of arguments to pass assertion.

* Update src/sparseml/version.py

Co-authored-by: Eldar Kurtic <eldar.ciki@gmail.com>
Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>
Co-authored-by: Alexandre Marques <alexandre@neuralmagic.com>
Co-authored-by: Konstantin Gulin <66528950+KSGulin@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: dhuangnm <74931910+dhuangnm@users.noreply.github.com>
Co-authored-by: dhuang <dhuang@dhuangs-MacBook-Pro.local>
Co-authored-by: Ricky Costa <79061523+InquestGeronimo@users.noreply.github.com>
Co-authored-by: dbogunowicz <97082108+dbogunowicz@users.noreply.github.com>
  • Loading branch information
12 people committed May 2, 2022
1 parent 81f5f33 commit d82e3bd
Show file tree
Hide file tree
Showing 25 changed files with 1,906 additions and 2,211 deletions.
206 changes: 138 additions & 68 deletions integrations/huggingface-transformers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,11 @@ limitations under the License.

# SparseML Hugging Face Transformers Integration

This directory combines the SparseML recipe-driven approach with the
[huggingface/transformers](https://github.com/huggingface/transformers) repository.
By integrating the robust training flows in the `transformers` repository with the SparseML code base,
we enable model sparsification techniques on popular NLP models such as [BERT](https://arxiv.org/abs/1810.04805)
creating smaller and faster deployable versions.
The techniques include, but are not limted to:
This directory combines the SparseML recipe-driven approach with the [huggingface/transformers](https://github.com/huggingface/transformers) repository. By integrating the robust training flows in the `transformers` repository with the SparseML code base, we enable model sparsification techniques on popular NLP models such as [BERT](https://arxiv.org/abs/1810.04805) creating smaller and faster deployable versions. The techniques include, but are not limited to:

- Pruning
- Quantization
- Pruning and Quantization
- Knowledge Distillation
- Sparse Transfer Learning

## Highlights
Expand All @@ -34,87 +30,161 @@ Coming soon!
## Tutorials

- [Sparsifying BERT Models Using Recipes](https://github.com/neuralmagic/sparseml/blob/main/integrations/huggingface-transformers/tutorials/sparsifying_bert_using_recipes.md)
- [Sparse Transfer Learning With BERT](https://github.com/neuralmagic/sparseml/blob/main/integrations/huggingface-transformers/tutorials/bert_sparse_transfer_learning.md)

## Installation

To begin, run the following command in the root directory of this integration (`cd integrations/huggingface-transformers`):
```bash
bash setup_integration.sh
pip install sparseml[torch]
```

The `setup_integration.sh` file will clone the transformers repository with the SparseML integration as a subfolder.
After the repo has successfully cloned, transformers and datasets will be installed along with any necessary dependencies.

It is recommended to run Python 3.8 as some of the scripts within the transformers repository require it.

## Quick Tour
**Note**: Transformers will not immediately install with this command. Instead, a sparsification-compatible version of Transformers will install on the first invocation of the Transformers code in SparseML.

## SparseML CLI

The SparseML installation provides a CLI for sparsifying your models for a specific task; appending the `--help` argument will provide a full list of options for training in SparseML:

```bash
sparseml.transformers.[task] --help
```

e.g. `sparseml.transformers.question_answering --help`

output:

```bash
--output_dir: The directory in which to store the outputs from the training runs such as results, the trained model, and supporting files.
--model_name_or_path: The path or SparseZoo stub for the model to load for training.
--recipe: The path or SparseZoo stub for the recipe to use to apply sparsification algorithms or sparse transfer learning to the model.
--distill_teacher: The path or SparseZoo stub for the teacher to load for distillation.
--dataset_name or --task_name: The dataset or task to load for training.
```

## Sparse Transfer Learning | Question Answering Example

Recipes encode the instructions and hyperparameters for sparsifying a model using modifiers to the training process.
The modifiers can range from pruning and quantization to learning rate and weight decay.
When appropriately combined, it becomes possible to create highly sparse and accurate models.
### Dense Teacher Creation

This integration adds a `--recipe` argument to the [`run_qa.py`](https://github.com/neuralmagic/transformers/blob/master/examples/pytorch/question-answering/run_qa.py) script among others.
The argument loads an appropriate recipe while preserving the rest of the training pipeline.
Popular recipes used with this argument are found in the [`recipes` folder](./recipes).
Distillation arguments to support student-teacher distillation are additionally added to the scripts as they help improve the recovery while sparsifying.
Otherwise, all other arguments and functionality remain the same as the original repository.
To enable distillation, you will first create a dense teacher model that the sparse model will learn from while transferring. **If you already have a Transformers-compatible model, you can use this as the dense teacher in place of training one from scratch.** The following command will use the dense BERT base model from the SparseZoo and fine-tune it on the SQuAD dataset, resulting in a model that achieves 88.5% F1 on the validation set:

For example, pruning and quantizing a model on the SQuAD dataset can be done by running the following command from within the root of this integration's folder:
```bash
python transformers/examples/pytorch/question-answering/run_qa.py \
--model_name_or_path bert-base-uncased \
--dataset_name squad \
--do_train \
--do_eval \
--evaluation_strategy epoch \
--per_device_train_batch_size 16 \
--learning_rate 5e-5 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir MODELS_DIR/bert-base-12layers_prune80 \
--cache_dir cache \
--preprocessing_num_workers 6 \
--fp16 \
--num_train_epochs 30 \
--recipe recipes/bert-base-12layers_prune80.md \
--onnx_export_path MODELS_DIR/bert-base-12layers_prune80/onnx \
--save_strategy epoch \
--save_total_limit 2
sparseml.transformers.question_answering \
--output_dir models/teacher \
--model_name_or_path zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/base-none \
--recipe zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/base-none?recipe_type=transfer-question_answering \
--dataset_name squad \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 24 \
--preprocessing_num_workers 6 \
--do_train \
--do_eval \
--evaluation_strategy epoch \
--fp16 \
--seed 42 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 24 \
--save_strategy epoch \
--save_total_limit 1
```

### Structure
With the dense teacher trained to convergence, you can begin the sparse transfer learning with distillation with a recipe. The dense teacher will distill knowledge into the sparse architecture, therefore increasing its performance while ideally converging to the dense solution’s accuracy.

The following table lays out the root-level files and folders along with a description for each.
💡**PRO TIP**💡: Recipes encode the instructions and hyperparameters for sparsifying a model using modifiers to the training process. The modifiers can range from pruning and quantization to learning rate and weight decay. When appropriately combined, it becomes possible to create highly sparse and accurate models.

| Folder/File Name | Description |
|----------------------|-----------------------------------------------------------------------------------------------------------------------|
| recipes | Typical recipes for sparsifying NLP models along with any downloaded recipes from the SparseZoo. |
| tutorials | Tutorial walkthroughs for how to sparsify NLP models using recipes. |
| transformers | Integration repository folder used to train and sparsify NLP models (`setup_integration.sh` must run first). |
| README.md | Readme file. |
| setup_integration.sh | Setup file for the integration run from the command line. |
Once the command has completed, you will have a sparse checkpoint located in `models/sparse_quantized`.

### Exporting for Inference
### Transfer Learn the Model

After sparsifying a model, the `run_qa.py` script can be run with the `--onnx_export_path` argument to convert the model into an [ONNX](https://onnx.ai/) deployment format.
The export process is modified such that the quantized and pruned models are corrected and folded properly.
The following command will use the 80% sparse-quantized BERT model from the SparseZoo and fine-tune it on the SQuAD dataset, resulting in a model that achieves an F1 of 88.5% on the validation set. Keep in mind that the `--distill_teacher` argument is set to pull a dense SQuAD model from the SparseZoo to enable it to run independent of the dense teacher step. If you trained a dense teacher, change this out for the path to your model folder:

For example, the following command can be run from within the integration's folder to export a trained/sparsified model's checkpoint:
```bash
python transformers/examples/pytorch/question-answering/run_qa.py \
--model_name_or_path MODELS_DIR/bert-base-12layers_prune80 \
--dataset_name squad \
--do_eval \
--per_device_eval_batch_size 64 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir MODELS_DIR/bert-base-12layers_prune80/eval \
--cache_dir cache \
--preprocessing_num_workers 6 \
--onnx_export_path MODELS_DIR/bert-base-12layers_prune80/onnx
sparseml.transformers.question_answering \
--output_dir models/sparse_quantized \
--model_name_or_path zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/12layer_pruned80_quant-none-vnni \
--recipe zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/12layer_pruned80_quant-none-vnni?recipe_type=transfer-question_answering \
--distill_teacher zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/base-none \
--dataset_name squad \
--per_device_train_batch_size 12 \
--per_device_eval_batch_size 24 \
--preprocessing_num_workers 6 \
--do_train \
--do_eval \
--evaluation_strategy epoch \
--fp16 \
--seed 21636 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 24 \
--preprocessing_num_workers 6 \
--save_strategy epoch \
--save_total_limit 1
```

The DeepSparse Engine [accepts ONNX formats](https://docs.neuralmagic.com/sparseml/source/onnx_export.html) and is engineered to significantly speed up inference on CPUs for the sparsified models from this integration.
Examples for loading, benchmarking, and deploying can be found in the [DeepSparse repository here](https://github.com/neuralmagic/deepsparse).
### Exporting to ONNX

The DeepSparse Engine uses the ONNX format to load neural networks and then deliver breakthrough performance for CPUs by leveraging the sparsity and quantization within a network.

The SparseML installation provides a `sparseml.transformers.export_onnx` command that you can use to load the training model folder and create a new model.onnx file within. Be sure the `--model_path` argument points to your trained model. By default, it is set to the result from transfer learning a sparse-quantized BERT model:

```bash
sparseml.transformers.export_onnx \
--model_path models/sparse_quantized \
--task 'question-answering' \
--sequence_length 384
```

### DeepSparse Engine Deployment

Now that the model is in an ONNX format, it is ready for deployment with the DeepSparse Engine.

Run the following command to install it:

```bash
pip install deepsparse
```

Once DeepSparse is installed on your deployment environment, two options are supported for deployment:
- A Python API that will fit into our current deployment pipelines.
- The DeepSparse Server that enables a no-code CLI solution to run your model via FastAPIs HTTP server.

### 🐍 Python API

The Python code below gives an example for using the DeepSparse Python pipeline API with different tasks. Be sure to change out the `model_path` argument for the model folder of your trained model:

Python Pipeline:

```python
from deepsparse.transformers import pipeline

model_path = "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni"

qa_pipeline = pipeline(
task="question-answering",
model_path=model_path
)

inference = qa_pipeline(question="What's my name?", context="My name is Snorlax")
print(inference)
```
printout:

{'score': 0.9947717785835266, 'start': 11, 'end': 18, 'answer': 'Snorlax'}

### 🔌DeepSparse Server

To use the DeepSparse Server, first install the required dependencies using pip:

```bash
pip install deepsparse[server]
```

Once installed, the CLI command given below for serving a BERT model is available. The commands are set up to be able to run independently of the prior stages. Once launched, you can view info over the server and the available APIs at `http://0.0.0.0:5543` on the deployment machine.

```bash
deepsparse.server \
--task question_answering \
--batch_size 1 \
--model_path "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni"
```

**Note: there is currently a known issue where conversion of the BERT models from PyTorch into ONNX is not preserving the accuracy of the model for some tasks and datasets. If you encounter this issue, try rolling back to the 0.9.0 release. As a resolution is being actively investigated, this note will be removed when the issue has been remediated.**
For more details, check out the [Getting Started with the DeepSparse Server](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/server).
2 changes: 1 addition & 1 deletion research/information_retrieval/doc2query/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ nbformat==5.1.3
nest-asyncio==1.5.1
networkx==2.5.1
nltk==3.6.6
notebook==6.4.1
notebook==6.4.10
numpy==1.21.0
onnx==1.7.0
onnxruntime==1.8.0
Expand Down
Loading

0 comments on commit d82e3bd

Please sign in to comment.