Merge remote-tracking branch 'origin/main' into rename-llm-variables

neuralmagic · Sep 1, 2023 · 5f4865f · 5f4865f
2 parents 97532b1 + fa73631
commit 5f4865f
Show file tree

Hide file tree

Showing 34 changed files with 2,618 additions and 267 deletions.
diff --git a/docs/use-cases/cv/embedding-extraction.md b/docs/use-cases/cv/embedding-extraction.md
@@ -106,3 +106,31 @@ print(len(result["embeddings"][0][0]))
 
 ### Cross Use Case Functionality
 Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring the Server.
+
+## Using a Custom ONNX File 
+Apart from using models from the SparseZoo, DeepSparse allows you to define custom ONNX files for embedding extraction. 
+
+The first step is to obtain the ONNX model. You can obtain the file by converting your model to ONNX after training. 
+
+Download the [ResNet-50 - ImageNet](https://sparsezoo.neuralmagic.com/models/cv%2Fclassification%2Fresnet_v1-50%2Fpytorch%2Fsparseml%2Fimagenet%2Fpruned95_uniform_quant-none) ONNX model for demonstration:
+
+```bash
+sparsezoo.download zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_uniform_quant-none --save-dir ./embedding-extraction
+```
+Use the ResNet-50 ONNX model for embedding extraction:
+```python
+from deepsparse import Pipeline
+
+# this step removes the projection head before compiling the model
+rn50_embedding_pipeline = Pipeline.create(
+    task="embedding-extraction",
+    base_task="image-classification", # tells the pipeline to expect images and normalize input with ImageNet means/stds
+    model_path="embedding-extraction/model.onnx",
+    emb_extraction_layer=-3, # extracts last layer before projection head and softmax
+)
+
+# this step runs pre-processing, inference and returns an embedding
+embedding = rn50_embedding_pipeline(images="lion.jpeg")
+print(len(embedding.embeddings[0][0]))
+# 2048
+```
diff --git a/docs/use-cases/cv/image-classification.md b/docs/use-cases/cv/image-classification.md
@@ -259,6 +259,31 @@ resp = requests.post(url=url, files=files)
 print(resp.text)
 # {"labels":[291,260,244],"scores":[24.185693740844727,18.982254028320312,16.390701293945312]}
 ```
+
 ### Cross Use Case Functionality
 
 Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring the Server.
+## Using a Custom ONNX File 
+Apart from using models from the SparseZoo, DeepSparse allows you to define custom ONNX files when deploying a model. 
+
+The first step is to obtain the ONNX model. You can obtain the file by converting your model to ONNX after training. 
+
+Download the [ResNet-50 - ImageNet](https://sparsezoo.neuralmagic.com/models/cv%2Fclassification%2Fresnet_v1-50%2Fpytorch%2Fsparseml%2Fimagenet%2Fpruned95_uniform_quant-none) ONNX model for demonstration:
+```bash
+sparsezoo.download zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_uniform_quant-none --save-dir ./image_classification
+```
+Use the ResNet-50 ONNX model for inference:
+```python
+from deepsparse import Pipeline
+
+# download onnx from sparsezoo and compile with batch size 1
+pipeline = Pipeline.create(
+  task="image_classification",
+  model_path="image_classification/model.onnx",   # sparsezoo stub or path to local ONNX
+)
+
+# run inference on image file
+prediction = pipeline(images=["lion.jpeg"])
+print(prediction.labels)
+# [291]
+```
diff --git a/docs/use-cases/cv/image-segmentation-yolact.md b/docs/use-cases/cv/image-segmentation-yolact.md
@@ -224,6 +224,32 @@ resp = requests.post(url=url, files=files)
 annotations = json.loads(resp.text) # dictionary of annotation results
 boxes, classes, masks, scores = annotations["boxes"], annotations["classes"], annotations["masks"], annotations["scores"]
 ```
+
 ### Cross Use Case Functionality
 
 Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring the Server.
+
+## Using a Custom ONNX File 
+Apart from using models from the SparseZoo, DeepSparse allows you to define custom ONNX files when deploying a model. 
+
+The first step is to obtain the ONNX model. You can obtain the file by converting your model to ONNX after training. 
+
+Download on the [YOLCAT](https://sparsezoo.neuralmagic.com/models/cv%2Fsegmentation%2Fyolact-darknet53%2Fpytorch%2Fdbolya%2Fcoco%2Fpruned82_quant-none) ONNX model for demonstration:
+```bash
+sparsezoo.download zoo:cv/segmentation/yolact-darknet53/pytorch/dbolya/coco/pruned82_quant-none --save-dir ./yolact
+```
+Use the YOLACT ONNX model for inference: 
+```python
+from deepsparse.pipeline import Pipeline
+
+yolact_pipeline = Pipeline.create(
+    task="yolact",
+    model_path="yolact/model.onnx",
+)
+
+images = ["thailand.jpeg"]
+predictions = yolact_pipeline(images=images)
+# predictions has attributes `boxes`, `classes`, `masks` and `scores`
+predictions.classes[0]
+# [20,20, .......0, 0,24]
+```
diff --git a/docs/use-cases/cv/object-detection-yolov5.md b/docs/use-cases/cv/object-detection-yolov5.md
@@ -285,3 +285,39 @@ print(labels)
 ### Cross Use Case Functionality
 
 Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring a Server.
+## Using a Custom ONNX File 
+Apart from using models from the SparseZoo, DeepSparse allows you to define custom ONNX files when deploying a model. 
+
+The first step is to obtain the YOLOv5 ONNX model. This could be a YOLOv5 model you have trained and converted to ONNX. 
+In this case, let's demonstrate by converting a YOLOv5 model to ONNX using the `ultralytics` package: 
+```python
+from ultralytics import YOLO
+
+# Load a model
+model = YOLO("yolov5nu.pt")  # load a pretrained model
+success = model.export(format="onnx")  # export the model to ONNX format
+```
+Download a sample image for detection: 
+```bash
+wget -O basilica.jpg https://github.com/raw/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg
+
+```
+Next, run the DeepSparse object detection pipeline with the custom ONNX file:
+
+```python
+from deepsparse import Pipeline
+
+# download onnx from sparsezoo and compile with batch size 1
+yolo_pipeline = Pipeline.create(
+  task="yolo",
+  model_path="yolov5nu.onnx",   # sparsezoo stub or path to local ONNX
+)
+images = ["basilica.jpg"]
+
+# run inference on image file
+pipeline_outputs = yolo_pipeline(images=images)
+print(pipeline_outputs.boxes)
+print(pipeline_outputs.labels)
+# [[[-0.8809833526611328, 5.1244752407073975, 27.885415077209473, 57.20366072654724], [-9.014896631240845, -2.4366320967674255, 21.488688468933105, 37.2245477437973], [14.241515636444092, 11.096746131777763, 30.164274215698242, 22.02291651070118], [7.107024908065796, 5.017698150128126, 15.09239387512207, 10.45704211294651]]]
+# [['8367.0', '1274.0', '8192.0', '6344.0']]
+```
diff --git a/docs/use-cases/general/bucketing.md b/docs/use-cases/general/bucketing.md
@@ -0,0 +1,135 @@
+<!--
+Copyright (c) 2021 - present / Neuralmagic, Inc. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# How to Use Bucketing With DeepSparse 
+DeepSparse supports bucketing to lower latency and increase the throughput of deep learning pipelines. Bucketing sequences of different sizes increases inference speed.
+
+Input lengths in NLP problems can vary. We usually select a maximum length where sentences longer that the maximum length are truncated and shorter ones are padded to reach the maximum length. This solution can be inefficient for real-world applications leading to more memory utilization.  
+
+Bucketing is a solution that places sequences of varying lengths in different buckets. It is more efficient because it reduces the amount of padding required. 
+
+In this document, we will explore how to use bucketing with DeepSparse. 
+
+## How Bucketing Works in DeepSparse 
+DeepSparse handles bucketing natively to reduce the time you would otherwise spend building this preprocessing pipeline. Bucketing with DeepSparse leads to a performance boost compared to a pipeline without bucketing. When buckets are provided, DeepSparse will create different models for the provided input sizes.
+
+For example, if your input data length ranges from 157 to 4063, with 700 being the median and you are using a model like BERT, whose maximum token length is 512, you can use these input shapes [256,320,384,448, 512]. This means that all tokens shorter than 256 will be padded to 256, while any tokens longer than 512 will be truncated to 512. Tokens longer than 256 will be padded to 320, and so on. 
+
+At inference, each input is sent to the corresponding bucketed model. In this case, you’d have 5 models because you have defined 5 buckets. Bucketing reduces the amount of compute because you are no longer padding all the sequences to the maximum length in the dataset. You can decide on the bucket sizes by examining the distribution of the dataset and experimenting with different sizes. The best choice is the one that covers all the inputs in the range of the dataset. 
+
+## Bucketing NLP Models with DeepSparse 
+DeepSparse makes it easy to set up bucketing. You pass the desired bucket sizes, and DeepSparse will automatically set up the buckets. You can determine the optimal size of the buckets by analyzing the lengths of the input data and selecting buckets where most of the data lies. 
+
+For example, here's the distribution of the [wnut_17](https://huggingface.co/datasets/wnut_17) dataset: 
+![image](images/wnut.png)
+Visualizing the data distribution enables you to choose the best bucket sizes to use. 
+
+Define a token classification pipeline that uses no buckets, later you will compare it performance with one that uses buckets. The `deployment` folder contains the model configuration files for a token classification model obtained by:
+```bash 
+sparsezoo.download zoo:nlp/token_classification/bert-large/pytorch/huggingface/conll2003/base-none --save-dir ./dense-model
+```
+The folder contains:
+- `config.json`
+- `model.onnx`
+- `tokenizer.json`
+
+```python
+from deepsparse import Pipeline
+import deepsparse.transformers
+from datasets import load_dataset
+from transformers import AutoTokenizer
+from tqdm import tqdm
+import time
+
+def run(model_path, batch_size, buckets):
+    ### SETUP DATASETS - in this case, we download WNUT_17
+    print("Setting up the dataset:")
+
+    INPUT_COL = "sentences"
+    dataset = load_dataset("wnut_17", split="train")
+    sentences = []
+    for sentence in dataset["tokens"]:
+        string = ""
+        for elt in sentence:
+            string += elt
+            string += " "
+        sentences.append(string)
+    dataset = dataset.add_column(INPUT_COL, sentences)
+
+    ### TOKENIZE DATASET - (used to comptue buckets)
+    tokenizer = AutoTokenizer.from_pretrained(model_path)
+    def pre_process_fn(examples):
+        return tokenizer(examples[INPUT_COL], add_special_tokens=True, return_tensors="np",padding=False,truncation=False)
+
+    dataset = dataset.map(pre_process_fn, batched=True)
+    dataset = dataset.add_column("num_tokens", list(map(len, dataset["input_ids"])))
+    dataset = dataset.sort("num_tokens")
+    max_token_len = dataset[-1]["num_tokens"]
+
+    ### SPLIT DATA INTO BATCHES
+    num_pad_items = batch_size - (dataset.num_rows % batch_size)
+    inputs = ([""] * num_pad_items) + dataset[INPUT_COL]
+    batches = []
+    for b_index_start in range(0, len(inputs), batch_size):
+        batches.append(inputs[b_index_start:b_index_start+batch_size])
+
+    ### RUN THROUPUT TESTING
+    print("\nCompiling models:")
+
+    # compile model with buckets
+    buckets.append(max_token_len)
+    ds_pipeline = Pipeline.create(
+        "token_classification",
+        model_path=model_path, 
+        batch_size=batch_size,
+        sequence_length=buckets,
+        )
+
+    print("\nRunning test:")
+
+    # run inferences on the dataset
+    start = time.perf_counter()
+
+    predictions = []
+    for batch in tqdm(batches): 
+        predictions.append(ds_pipeline(batch))
+
+    # flatten and remove padded predictions
+    predictions = [pred for sublist in predictions for pred in sublist.predictions]
+    predictions = predictions[num_pad_items:]
+    end = time.perf_counter()
+
+    # compute throughput
+    total_time_executing = (end - start) * 1000.0 
+    items_per_sec = len(predictions) / total_time_executing
+
+    print(f"Items Per Second: {items_per_sec}")
+    print(f"Program took: {total_time_executing} ms")
+    return predictions
+
+predictions = run("token_classification", 64, [])
+# Items Per Second: 0.0060998544593741395
+# Program took: 556406.7179970443 ms
+```
+
+Run the same script with varying input lengths: 
+```python
+batch_size = 64
+buckets = [15,35,55,75]
+predictions = run("token_classification", batch_size, buckets)
+# Items Per Second: 0.01046572543802951
+# Program took: 324296.67872493155 ms
+```
+The pipeline using buckets achieves 1.7 more items per second compared to the one without. 
diff --git a/docs/use-cases/general/images/wnut.png b/docs/use-cases/general/images/wnut.png