Add custom ONNX guides for use-cases. Add bucketing and scheduler gui…

…des (#960) * added user guide * Delete qa_server_config.yaml * removed gatsby headers * update benchmarking * Update benchmarking.md * Update and rename benchmarking.md to deepsparse-benchmarking.md * Update deepsparse-pipelines.md * Update deepsparse-server.md * Update scheduler.md * Update user-guide/scheduler.md Co-authored-by: Michael Goin <michael@neuralmagic.com> * Update user-guide/scheduler.md Co-authored-by: Michael Goin <michael@neuralmagic.com> * Update user-guide/scheduler.md Co-authored-by: Michael Goin <michael@neuralmagic.com> * Update user-guide/deepsparse-pipelines.md Co-authored-by: Michael Goin <michael@neuralmagic.com> * Update user-guide/deepsparse-pipelines.md Co-authored-by: Michael Goin <michael@neuralmagic.com> * Update user-guide/deepsparse-pipelines.md Co-authored-by: Michael Goin <michael@neuralmagic.com> * added README * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * added sentiment-analysis * Update sentiment-analysis.md * added installation * Update installation.md * Update installation.md * Update README.md * Update deepsparse-pipelines.md * Update deepsparse-pipelines.md * add text classification doc * add text classification doc * add text classification doc * Use Engine * add question answering document * add token classification document * update benchmarks * add transformers extraction embedding doc * add general embedding doc * add image classification doc * add image classification doc * add yolo document * add YOLACT doc * update yolov5 doc * update yolov5 doc * Update yolov5-object-detection.md * Update image-classification.md * Update image-segmentation-yolact.md * Apply suggestions from code review Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com> * RS Edits to CV * updated embedding extraction example * updated sentiment analysis and text classification examples * added zero shot text classification * RS edited token classification * updated question answering example * updated embedding extraction case * updated directory structure * updated dir structure * updated dir structure * Update image-classification.md * Update image-classification.md * Update image-classification.md * Update object-detection-yolov5.md * Update object-detection-yolov5.md * Update object-detection-yolov5.md * Update image-segmentation-yolact.md * Update image-segmentation-yolact.md * Update embedding-extraction.md * Update sentiment-analysis.md * Update question-answering.md * Update text-classification.md * Update embedding-extraction.md * Create README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update deepsparse-pipelines.md * Update deepsparse-pipelines.md * Update deepsparse-pipelines.md * Update deepsparse-server.md * Update deepsparse-pipelines.md * Update deepsparse-server.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update image-segmentation-yolact.md * Update image-classification.md * Update object-detection-yolov5.md * Update question-answering.md * Update sentiment-analysis.md * Update text-classification.md * Update token-classification.md * Update transformers-embedding-extraction.md * Update zero-shot-text-classification.md * Update question-answering.md * Update question-answering.md * Update question-answering.md * Update sentiment-analysis.md * Update sentiment-analysis.md * Update sentiment-analysis.md * Update text-classification.md * Update text-classification.md * Update text-classification.md * Update text-classification.md * Update token-classification.md * Update token-classification.md * Update zero-shot-text-classification.md * Update zero-shot-text-classification.md * Update zero-shot-text-classification.md * Update transformers-embedding-extraction.md * Update embedding-extraction.md * Update image-classification.md * Update image-segmentation-yolact.md * Update object-detection-yolov5.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Add files via upload * Update README.md * Update README.md * Update README.md * added copyrights * Update README.md Co-authored-by: Michael Goin <michael@neuralmagic.com> * Update README.md Co-authored-by: Michael Goin <michael@neuralmagic.com> * Update README.md Co-authored-by: Michael Goin <michael@neuralmagic.com> * Update README.md Co-authored-by: Michael Goin <michael@neuralmagic.com> * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * How to use the scheduler across engine, pipeline, server * How to use the scheduler across engine, pipeline, server * Using custom ONNX file with YOLOv5 * YOLACT ONNX docs * RestNet ONNX docs * ONNX embedding extraction * custom ONNX question answering * custom ONNX sentiment analysis * update sentiment and QA docs * update sentiment and QA docs * text classification ONNX * token classification ONNX * transformer embedding extraction ONNX * zero shot text classification ONNX * Add copy right * bucketing docs * update bucketing * Download models * move ONNX docs * update model download section * Update qa docs * scheduler update * Fix merge with main --------- Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Derrick Mwiti <mwitiderrick@gmail.com> Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com>
neuralmagic · Sep 1, 2023 · fa73631 · fa73631
1 parent 136b9f8
commit fa73631
Show file tree

Hide file tree

Showing 13 changed files with 620 additions and 1 deletion.
diff --git a/docs/use-cases/cv/embedding-extraction.md b/docs/use-cases/cv/embedding-extraction.md
@@ -106,3 +106,31 @@ print(len(result["embeddings"][0][0]))
 
 ### Cross Use Case Functionality
 Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring the Server.
+
+## Using a Custom ONNX File 
+Apart from using models from the SparseZoo, DeepSparse allows you to define custom ONNX files for embedding extraction. 
+
+The first step is to obtain the ONNX model. You can obtain the file by converting your model to ONNX after training. 
+
+Download the [ResNet-50 - ImageNet](https://sparsezoo.neuralmagic.com/models/cv%2Fclassification%2Fresnet_v1-50%2Fpytorch%2Fsparseml%2Fimagenet%2Fpruned95_uniform_quant-none) ONNX model for demonstration:
+
+```bash
+sparsezoo.download zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_uniform_quant-none --save-dir ./embedding-extraction
+```
+Use the ResNet-50 ONNX model for embedding extraction:
+```python
+from deepsparse import Pipeline
+
+# this step removes the projection head before compiling the model
+rn50_embedding_pipeline = Pipeline.create(
+    task="embedding-extraction",
+    base_task="image-classification", # tells the pipeline to expect images and normalize input with ImageNet means/stds
+    model_path="embedding-extraction/model.onnx",
+    emb_extraction_layer=-3, # extracts last layer before projection head and softmax
+)
+
+# this step runs pre-processing, inference and returns an embedding
+embedding = rn50_embedding_pipeline(images="lion.jpeg")
+print(len(embedding.embeddings[0][0]))
+# 2048
+```
diff --git a/docs/use-cases/cv/image-classification.md b/docs/use-cases/cv/image-classification.md
@@ -259,6 +259,31 @@ resp = requests.post(url=url, files=files)
 print(resp.text)
 # {"labels":[291,260,244],"scores":[24.185693740844727,18.982254028320312,16.390701293945312]}
 ```
+
 ### Cross Use Case Functionality
 
 Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring the Server.
+## Using a Custom ONNX File 
+Apart from using models from the SparseZoo, DeepSparse allows you to define custom ONNX files when deploying a model. 
+
+The first step is to obtain the ONNX model. You can obtain the file by converting your model to ONNX after training. 
+
+Download the [ResNet-50 - ImageNet](https://sparsezoo.neuralmagic.com/models/cv%2Fclassification%2Fresnet_v1-50%2Fpytorch%2Fsparseml%2Fimagenet%2Fpruned95_uniform_quant-none) ONNX model for demonstration:
+```bash
+sparsezoo.download zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_uniform_quant-none --save-dir ./image_classification
+```
+Use the ResNet-50 ONNX model for inference:
+```python
+from deepsparse import Pipeline
+
+# download onnx from sparsezoo and compile with batch size 1
+pipeline = Pipeline.create(
+  task="image_classification",
+  model_path="image_classification/model.onnx",   # sparsezoo stub or path to local ONNX
+)
+
+# run inference on image file
+prediction = pipeline(images=["lion.jpeg"])
+print(prediction.labels)
+# [291]
+```
diff --git a/docs/use-cases/cv/image-segmentation-yolact.md b/docs/use-cases/cv/image-segmentation-yolact.md
@@ -224,6 +224,32 @@ resp = requests.post(url=url, files=files)
 annotations = json.loads(resp.text) # dictionary of annotation results
 boxes, classes, masks, scores = annotations["boxes"], annotations["classes"], annotations["masks"], annotations["scores"]
 ```
+
 ### Cross Use Case Functionality
 
 Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring the Server.
+
+## Using a Custom ONNX File 
+Apart from using models from the SparseZoo, DeepSparse allows you to define custom ONNX files when deploying a model. 
+
+The first step is to obtain the ONNX model. You can obtain the file by converting your model to ONNX after training. 
+
+Download on the [YOLCAT](https://sparsezoo.neuralmagic.com/models/cv%2Fsegmentation%2Fyolact-darknet53%2Fpytorch%2Fdbolya%2Fcoco%2Fpruned82_quant-none) ONNX model for demonstration:
+```bash
+sparsezoo.download zoo:cv/segmentation/yolact-darknet53/pytorch/dbolya/coco/pruned82_quant-none --save-dir ./yolact
+```
+Use the YOLACT ONNX model for inference: 
+```python
+from deepsparse.pipeline import Pipeline
+
+yolact_pipeline = Pipeline.create(
+    task="yolact",
+    model_path="yolact/model.onnx",
+)
+
+images = ["thailand.jpeg"]
+predictions = yolact_pipeline(images=images)
+# predictions has attributes `boxes`, `classes`, `masks` and `scores`
+predictions.classes[0]
+# [20,20, .......0, 0,24]
+```
diff --git a/docs/use-cases/cv/object-detection-yolov5.md b/docs/use-cases/cv/object-detection-yolov5.md
@@ -285,3 +285,39 @@ print(labels)
 ### Cross Use Case Functionality
 
 Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring a Server.
+## Using a Custom ONNX File 
+Apart from using models from the SparseZoo, DeepSparse allows you to define custom ONNX files when deploying a model. 
+
+The first step is to obtain the YOLOv5 ONNX model. This could be a YOLOv5 model you have trained and converted to ONNX. 
+In this case, let's demonstrate by converting a YOLOv5 model to ONNX using the `ultralytics` package: 
+```python
+from ultralytics import YOLO
+
+# Load a model
+model = YOLO("yolov5nu.pt")  # load a pretrained model
+success = model.export(format="onnx")  # export the model to ONNX format
+```
+Download a sample image for detection: 
+```bash
+wget -O basilica.jpg https://github.com/raw/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg
+
+```
+Next, run the DeepSparse object detection pipeline with the custom ONNX file:
+
+```python
+from deepsparse import Pipeline
+
+# download onnx from sparsezoo and compile with batch size 1
+yolo_pipeline = Pipeline.create(
+  task="yolo",
+  model_path="yolov5nu.onnx",   # sparsezoo stub or path to local ONNX
+)
+images = ["basilica.jpg"]
+
+# run inference on image file
+pipeline_outputs = yolo_pipeline(images=images)
+print(pipeline_outputs.boxes)
+print(pipeline_outputs.labels)
+# [[[-0.8809833526611328, 5.1244752407073975, 27.885415077209473, 57.20366072654724], [-9.014896631240845, -2.4366320967674255, 21.488688468933105, 37.2245477437973], [14.241515636444092, 11.096746131777763, 30.164274215698242, 22.02291651070118], [7.107024908065796, 5.017698150128126, 15.09239387512207, 10.45704211294651]]]
+# [['8367.0', '1274.0', '8192.0', '6344.0']]
+```
diff --git a/docs/use-cases/general/bucketing.md b/docs/use-cases/general/bucketing.md
@@ -0,0 +1,135 @@
+<!--
+Copyright (c) 2021 - present / Neuralmagic, Inc. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# How to Use Bucketing With DeepSparse 
+DeepSparse supports bucketing to lower latency and increase the throughput of deep learning pipelines. Bucketing sequences of different sizes increases inference speed.
+
+Input lengths in NLP problems can vary. We usually select a maximum length where sentences longer that the maximum length are truncated and shorter ones are padded to reach the maximum length. This solution can be inefficient for real-world applications leading to more memory utilization.  
+
+Bucketing is a solution that places sequences of varying lengths in different buckets. It is more efficient because it reduces the amount of padding required. 
+
+In this document, we will explore how to use bucketing with DeepSparse. 
+
+## How Bucketing Works in DeepSparse 
+DeepSparse handles bucketing natively to reduce the time you would otherwise spend building this preprocessing pipeline. Bucketing with DeepSparse leads to a performance boost compared to a pipeline without bucketing. When buckets are provided, DeepSparse will create different models for the provided input sizes.
+
+For example, if your input data length ranges from 157 to 4063, with 700 being the median and you are using a model like BERT, whose maximum token length is 512, you can use these input shapes [256,320,384,448, 512]. This means that all tokens shorter than 256 will be padded to 256, while any tokens longer than 512 will be truncated to 512. Tokens longer than 256 will be padded to 320, and so on. 
+
+At inference, each input is sent to the corresponding bucketed model. In this case, you’d have 5 models because you have defined 5 buckets. Bucketing reduces the amount of compute because you are no longer padding all the sequences to the maximum length in the dataset. You can decide on the bucket sizes by examining the distribution of the dataset and experimenting with different sizes. The best choice is the one that covers all the inputs in the range of the dataset. 
+
+## Bucketing NLP Models with DeepSparse 
+DeepSparse makes it easy to set up bucketing. You pass the desired bucket sizes, and DeepSparse will automatically set up the buckets. You can determine the optimal size of the buckets by analyzing the lengths of the input data and selecting buckets where most of the data lies. 
+
+For example, here's the distribution of the [wnut_17](https://huggingface.co/datasets/wnut_17) dataset: 
+![image](images/wnut.png)
+Visualizing the data distribution enables you to choose the best bucket sizes to use. 
+
+Define a token classification pipeline that uses no buckets, later you will compare it performance with one that uses buckets. The `deployment` folder contains the model configuration files for a token classification model obtained by:
+```bash 
+sparsezoo.download zoo:nlp/token_classification/bert-large/pytorch/huggingface/conll2003/base-none --save-dir ./dense-model
+```
+The folder contains:
+- `config.json`
+- `model.onnx`
+- `tokenizer.json`
+
+```python
+from deepsparse import Pipeline
+import deepsparse.transformers
+from datasets import load_dataset
+from transformers import AutoTokenizer
+from tqdm import tqdm
+import time
+
+def run(model_path, batch_size, buckets):
+    ### SETUP DATASETS - in this case, we download WNUT_17
+    print("Setting up the dataset:")
+
+    INPUT_COL = "sentences"
+    dataset = load_dataset("wnut_17", split="train")
+    sentences = []
+    for sentence in dataset["tokens"]:
+        string = ""
+        for elt in sentence:
+            string += elt
+            string += " "
+        sentences.append(string)
+    dataset = dataset.add_column(INPUT_COL, sentences)
+
+    ### TOKENIZE DATASET - (used to comptue buckets)
+    tokenizer = AutoTokenizer.from_pretrained(model_path)
+    def pre_process_fn(examples):
+        return tokenizer(examples[INPUT_COL], add_special_tokens=True, return_tensors="np",padding=False,truncation=False)
+
+    dataset = dataset.map(pre_process_fn, batched=True)
+    dataset = dataset.add_column("num_tokens", list(map(len, dataset["input_ids"])))
+    dataset = dataset.sort("num_tokens")
+    max_token_len = dataset[-1]["num_tokens"]
+
+    ### SPLIT DATA INTO BATCHES
+    num_pad_items = batch_size - (dataset.num_rows % batch_size)
+    inputs = ([""] * num_pad_items) + dataset[INPUT_COL]
+    batches = []
+    for b_index_start in range(0, len(inputs), batch_size):
+        batches.append(inputs[b_index_start:b_index_start+batch_size])
+
+    ### RUN THROUPUT TESTING
+    print("\nCompiling models:")
+
+    # compile model with buckets
+    buckets.append(max_token_len)
+    ds_pipeline = Pipeline.create(
+        "token_classification",
+        model_path=model_path, 
+        batch_size=batch_size,
+        sequence_length=buckets,
+        )
+
+    print("\nRunning test:")
+
+    # run inferences on the dataset
+    start = time.perf_counter()
+
+    predictions = []
+    for batch in tqdm(batches): 
+        predictions.append(ds_pipeline(batch))
+
+    # flatten and remove padded predictions
+    predictions = [pred for sublist in predictions for pred in sublist.predictions]
+    predictions = predictions[num_pad_items:]
+    end = time.perf_counter()
+
+    # compute throughput
+    total_time_executing = (end - start) * 1000.0 
+    items_per_sec = len(predictions) / total_time_executing
+
+    print(f"Items Per Second: {items_per_sec}")
+    print(f"Program took: {total_time_executing} ms")
+    return predictions
+
+predictions = run("token_classification", 64, [])
+# Items Per Second: 0.0060998544593741395
+# Program took: 556406.7179970443 ms
+```
+
+Run the same script with varying input lengths: 
+```python
+batch_size = 64
+buckets = [15,35,55,75]
+predictions = run("token_classification", batch_size, buckets)
+# Items Per Second: 0.01046572543802951
+# Program took: 324296.67872493155 ms
+```
+The pipeline using buckets achieves 1.7 more items per second compared to the one without. 
diff --git a/docs/use-cases/general/images/wnut.png b/docs/use-cases/general/images/wnut.png