diff --git a/README.md b/README.md index 08586df612..4b49c3cfbe 100644 --- a/README.md +++ b/README.md @@ -52,155 +52,146 @@ limitations under the License. -A CPU runtime that takes advantage of sparsity within neural networks to reduce compute. Read [more about sparsification](https://docs.neuralmagic.com/user-guides/sparsification). -Neural Magic's DeepSparse is able to integrate into popular deep learning libraries (e.g., Hugging Face, Ultralytics) allowing you to leverage DeepSparse for loading and deploying sparse models with ONNX. -ONNX gives the flexibility to serve your model in a framework-agnostic environment. -Support includes [PyTorch,](https://pytorch.org/docs/stable/onnx.html) [TensorFlow,](https://github.com/onnx/tensorflow-onnx) [Keras,](https://github.com/onnx/keras-onnx) and [many other frameworks](https://github.com/onnx/onnxmltools). +[DeepSparse](https://github.com/neuralmagic/deepsparse) is a CPU inference runtime that takes advantage of sparsity within neural networks to execute inference quickly. Coupled with [SparseML](https://github.com/neuralmagic/sparseml), an open-source optimization library, DeepSparse enables you to achieve GPU-class performance on commodity hardware. + +

+ NM Flow +

+ +For details of training a sparse model for deployment with DeepSparse, [check out SparseML](https://github.com/neuralmagic/sparseml). ## Installation -Install DeepSparse Community as follows: +DeepSparse is available in two editions: +1. DeepSparse Community is free for evaluation, research, and non-production use with our [DeepSparse Community License](https://neuralmagic.com/legal/engine-license-agreement/). +2. DeepSparse Enterprise requires a [trial license](https://neuralmagic.com/deepsparse-free-trial/) or [can be fully licensed](https://neuralmagic.com/legal/master-software-license-and-service-agreement/) for production, commercial applications. + +#### Install via Docker (Recommended) + +DeepSparse Community is available as a container image hosted on [GitHub container registry](https://github.com/neuralmagic/deepsparse/pkgs/container/deepsparse). ```bash -pip install deepsparse +docker pull ghcr.io/neuralmagic/deepsparse:1.4.2 +docker tag ghcr.io/neuralmagic/deepsparse:1.4.2 deepsparse-docker +docker run -it deepsparse-docker ``` -DeepSparse is available in two editions: -1. [**DeepSparse Community**](#installation) is open-source and free for evaluation, research, and non-production use with our [DeepSparse Community License](https://neuralmagic.com/legal/engine-license-agreement/). -2. [**DeepSparse Enterprise**](https://docs.neuralmagic.com/products/deepsparse-ent) requires a Trial License or [can be fully licensed](https://neuralmagic.com/legal/master-software-license-and-service-agreement/) for production, commercial applications. - -## 🧰 Hardware Support and System Requirements +- [Check out the Docker page](https://github.com/neuralmagic/deepsparse/tree/main/docker/) for more details. -To ensure that your CPU is compatible with DeepSparse, it is recommended to review the [Supported Hardware for DeepSparse](https://docs.neuralmagic.com/user-guides/deepsparse-engine/hardware-support) documentation. +#### Install via PyPI +DeepSparse Community is also available via PyPI. We recommend using a virtual enviornment. -To ensure that you get the best performance from DeepSparse, it has been thoroughly tested on Python versions 3.7-3.10, ONNX versions 1.5.0-1.12.0, ONNX opset version 11 or higher, and manylinux compliant systems. It is highly recommended to use a [virtual environment](https://docs.python.org/3/library/venv.html) when running DeepSparse. Please note that DeepSparse is only supported natively on Linux. For those using Mac or Windows, running Linux in a Docker or virtual machine is necessary to use DeepSparse. +```bash +pip install deepsparse +``` -## Features +- [Check out the Installation page](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide/installation.md) for optional dependencies. -- 👩‍💻 Pipelines for [NLP](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/transformers), [CV Classification](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/image_classification), [CV Detection](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/yolo), [CV Segmentation](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/yolact) and more! -- 🔌 [DeepSparse Server](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/server) -- 📜 [DeepSparse Benchmark](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/benchmark) -- ☁️ [Cloud Deployments and Demos](https://github.com/neuralmagic/deepsparse/tree/main/examples) +## Hardware Support and System Requirements -### 👩‍💻 Pipelines +[Supported Hardware for DeepSparse](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide/hardware-support.md) -Pipelines are a high-level Python interface for running inference with DeepSparse across select tasks in NLP and CV: +DeepSparse is tested on Python versions 3.7-3.10, ONNX versions 1.5.0-1.12.0, ONNX opset version 11 or higher, and manylinux compliant systems. Please note that DeepSparse is only supported natively on Linux. For those using Mac or Windows, running Linux in a Docker or virtual machine is necessary to use DeepSparse. -| NLP | CV | -|-----------------------|---------------------------| -| Text Classification `"text_classification"` | Image Classification `"image_classification"` | -| Token Classification `"token_classification"` | Object Detection `"yolo"` | -| Sentiment Analysis `"sentiment_analysis"` | Instance Segmentation `"yolact"` | -| Question Answering `"question_answering"` | Keypoint Detection `"open_pif_paf"` | -| MultiLabel Text Classification `"text_classification"` | | -| Document Classification `"text_classification"` | | -| Zero-Shot Text Classification `"zero_shot_text_classification"` | | +## Deployment APIs +DeepSparse includes three deployment APIs: -**NLP Example** | Question Answering -```python -from deepsparse import Pipeline +- **Engine** is the lowest-level API. With Engine, you pass tensors and receive the raw logits. +- **Pipeline** wraps the Engine with pre- and post-processing. With Pipeline, you pass raw data and receive the prediction. +- **Server** wraps Pipelines with a REST API using FastAPI. With Server, you send raw data over HTTP and receive the prediction. -qa_pipeline = Pipeline.create( - task="question-answering", - model_path="zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni", -) +### Engine -inference = qa_pipeline(question="What's my name?", context="My name is Snorlax") -``` -**CV Example** | Image Classification +The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, compiles the model, and runs inference on randomly generated input. ```python -from deepsparse import Pipeline +from deepsparse import Engine +from deepsparse.utils import generate_random_inputs, model_to_path -cv_pipeline = Pipeline.create( - task='image_classification', - model_path='zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none', -) +# download onnx, compile +zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" +batch_size = 1 +compiled_model = Engine(model=zoo_stub, batch_size=batch_size) + +# run inference (input is raw numpy tensors, output is raw scores) +inputs = generate_random_inputs(model_to_path(zoo_stub), batch_size) +output = compiled_model(inputs) +print(output) -input_image = "my_image.png" -inference = cv_pipeline(images=input_image) +# > [array([[-0.3380675 , 0.09602544]], dtype=float32)] << raw scores ``` +### DeepSparse Pipelines -### 🔌 DeepSparse Server +Pipeline is the default API for interacting with DeepSparse. Similar to Hugging Face Pipelines, DeepSparse Pipelines wrap Engine with pre- and post-processing (as well as other utilities), enabling you to send raw data to DeepSparse and receive the post-processed prediction. -DeepSparse Server is a tool that enables you to serve your models and pipelines directly from your terminal. +The example below downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo, sets up a pipeline, and runs inference on sample data. -The server is built on top of two powerful libraries: the FastAPI web framework and the Uvicorn web server. This combination ensures that DeepSparse Server delivers excellent performance and reliability. Install with this command: +```python +from deepsparse import Pipeline -```bash -pip install deepsparse[server] +# download onnx, set up pipeline +zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" +sentiment_analysis_pipeline = Pipeline.create( + task="sentiment-analysis", # name of the task + model_path=zoo_stub, # zoo stub or path to local onnx file +) + +# run inference (input is a sentence, output is the prediction) +prediction = sentiment_analysis_pipeline("I love using DeepSparse Pipelines") +print(prediction) +# > labels=['positive'] scores=[0.9954759478569031] ``` -#### Single Model +#### Additional Resources +- Check out the [Use Cases Page](https://github.com/neuralmagic/deepsparse/tree/main/docs/use-cases) for more details on supported tasks. +- Check out the [Pipelines User Guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide/deepsparse-pipelines.md) for more usage details. -Once installed, the following example CLI command is available for running inference with a single BERT model: +### DeepSparse Server -```bash -deepsparse.server \ - task question_answering \ - --model_path "zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni" -``` +Server wraps Pipelines with REST APIs, enabling you to stand up model serving endpoint running DeepSparse. This enables you to send raw data to DeepSparse over HTTP and receive the post-processed predictions. -To look up arguments run: `deepsparse.server --help`. - -#### Multiple Models -To deploy multiple models in your setup, a `config.yaml` file should be created. In the example provided, two BERT models are configured for the question-answering task: - -```yaml -num_workers: 1 -endpoints: - - task: question_answering - route: /predict/question_answering/base - model: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/base-none - batch_size: 1 - - task: question_answering - route: /predict/question_answering/pruned_quant - model: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni - batch_size: 1 -``` +DeepSparse Server is launched from the command line, configured via arguments or a server configuration file. The following downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format from SparseZoo and launches a sentiment analysis endpoint: -After the `config.yaml` file has been created, the server can be started by passing the file path as an argument: ```bash -deepsparse.server config config.yaml +deepsparse.server \ + --task sentiment-analysis \ + --model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none ``` -Read the [DeepSparse Server](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/server) README for further details. - -### 📜 DeepSparse Benchmark +Sending a request: -DeepSparse Benchmark, a command-line (CLI) tool, is used to evaluate the DeepSparse Engine's performance with ONNX models. This tool processes arguments, downloads and compiles the network into the engine, creates input tensors, and runs the model based on the selected scenario. - -Run `deepsparse.benchmark -h` to look up arguments: +```python +import requests -```shell -deepsparse.benchmark [-h] [-b BATCH_SIZE] [-i INPUT_SHAPES] [-ncores NUM_CORES] [-s {async,sync,elastic}] [-t TIME] - [-w WARMUP_TIME] [-nstreams NUM_STREAMS] [-pin {none,core,numa}] [-e ENGINE] [-q] [-x EXPORT_PATH] - model_path +url = "http://localhost:5543/predict" # Server's port default to 5543 +obj = {"sequences": "Snorlax loves my Tesla!"} +response = requests.post(url, json=obj) +print(response.text) +# {"labels":["positive"],"scores":[0.9965094327926636]} ``` +#### Additional Resources +- Check out the [Use Cases Page](https://github.com/neuralmagic/deepsparse/tree/main/docs/use-cases) for more details on supported tasks. +- Check out the [Server User Guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide/deepsparse-server.md) for more usage details. -Refer to the [Benchmark](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/benchmark) README for examples of specific inference scenarios. +## ONNX -### 🦉 Custom ONNX Model Support +DeepSparse accepts models in the ONNX format. ONNX models can be passed in one of two ways: -DeepSparse is capable of accepting ONNX models from two sources: +- **SparseZoo Stub**: [SparseZoo](https://sparsezoo.neuralmagic.com/) is an open-source repository of sparse models. The examples on this page use SparseZoo stubs to identify models and download them for deployment in DeepSparse. -**SparseZoo ONNX**: This is an open-source repository of sparse models available for download. [SparseZoo](https://github.com/neuralmagic/sparsezoo) offers inference-optimized models, which are trained using repeatable sparsification recipes and state-of-the-art techniques from [SparseML](https://github.com/neuralmagic/sparseml). - -**Custom ONNX**: Users can provide their own ONNX models, whether dense or sparse. By plugging in a custom model, users can compare its performance with other solutions. +- **Local ONNX File**: Users can provide their own ONNX models, whether dense or sparse. For example: ```bash -> wget https://github.com/onnx/models/raw/main/vision/classification/mobilenet/model/mobilenetv2-7.onnx -Saving to: ‘mobilenetv2-7.onnx’ +wget https://github.com/onnx/models/raw/main/vision/classification/mobilenet/model/mobilenetv2-7.onnx ``` -Custom ONNX Benchmark example: ```python -from deepsparse import compile_model +from deepsparse import Engine from deepsparse.utils import generate_random_inputs onnx_filepath = "mobilenetv2-7.onnx" batch_size = 16 @@ -209,34 +200,35 @@ batch_size = 16 inputs = generate_random_inputs(onnx_filepath, batch_size) # Compile and run -engine = compile_model(onnx_filepath, batch_size) -outputs = engine.run(inputs) +compiled_model = Engine(model=onnx_filepath, batch_size=batch_size) +outputs = compiled_model(inputs) +print(outputs[0].shape) +# (16, 1000) << batch, num_classes ``` -The [GitHub repository](https://github.com/neuralmagic/deepsparse) repository contains package APIs and examples that help users swiftly begin benchmarking and performing inference on sparse models. - -### Scheduling Single-Stream, Multi-Stream, and Elastic Inference +## Inference Modes -DeepSparse offers different inference scenarios based on your use case. Read more details here: [Inference Types](https://github.com/neuralmagic/deepsparse/blob/main/docs/source/scheduler.md). +DeepSparse offers different inference scenarios based on your use case. -⚡ **Single-stream** scheduling: the latency/synchronous scenario, requests execute serially. [`default`] +**Single-stream** scheduling: the latency/synchronous scenario, requests execute serially. [`default`] single stream diagram It's highly optimized for minimum per-request latency, using all of the system's resources provided to it on every request it gets. -⚡ **Multi-stream** scheduling: the throughput/asynchronous scenario, requests execute in parallel. +**Multi-stream** scheduling: the throughput/asynchronous scenario, requests execute in parallel. multi stream diagram The most common use cases for the multi-stream scheduler are where parallelism is low with respect to core count, and where requests need to be made asynchronously without time to batch them. -## Resources -#### Libraries -- [DeepSparse](https://docs.neuralmagic.com/deepsparse/) -- [SparseML](https://docs.neuralmagic.com/sparseml/) -- [SparseZoo](https://docs.neuralmagic.com/sparsezoo/) -- [Sparsify](https://docs.neuralmagic.com/sparsify/) +- [Check out the Scheduler User Guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide/scheduler.md) for more details. + +## Additional Resources +- [Benchmarking Performance](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide/deepsparse-benchmarking.md) +- [User Guide](https://github.com/neuralmagic/deepsparse/tree/main/docs/user-guide) +- [Use Cases](https://github.com/neuralmagic/deepsparse/tree/main/docs/use-cases) +- [Cloud Deployments and Demos](https://github.com/neuralmagic/deepsparse/tree/main/examples/) #### Versions - [DeepSparse](https://pypi.org/project/deepsparse) | stable @@ -251,7 +243,6 @@ The most common use cases for the multi-stream scheduler are where parallelism i ### Be Part of the Future... And the Future is Sparse! - Contribute with code, examples, integrations, and documentation as well as bug reports and feature requests! [Learn how here.](https://github.com/neuralmagic/deepsparse/blob/main/CONTRIBUTING.md) For user help or questions about DeepSparse, sign up or log in to our **[Deep Sparse Community Slack](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)**. We are growing the community member by member and happy to see you there. Bugs, feature requests, or additional questions can also be posted to our [GitHub Issue Queue.](https://github.com/neuralmagic/deepsparse/issues) You can get the latest news, webinar and event invites, research papers, and other ML Performance tidbits by [subscribing](https://neuralmagic.com/subscribe/) to the Neural Magic community. diff --git a/docs/neural-magic-workflow.png b/docs/neural-magic-workflow.png new file mode 100644 index 0000000000..f870b4e97c Binary files /dev/null and b/docs/neural-magic-workflow.png differ diff --git a/docs/_static/css/nm-theme-adjustment.css b/docs/old/_static/css/nm-theme-adjustment.css similarity index 100% rename from docs/_static/css/nm-theme-adjustment.css rename to docs/old/_static/css/nm-theme-adjustment.css diff --git a/docs/_templates/versions.html b/docs/old/_templates/versions.html similarity index 100% rename from docs/_templates/versions.html rename to docs/old/_templates/versions.html diff --git a/docs/api/.gitkeep b/docs/old/api/.gitkeep similarity index 100% rename from docs/api/.gitkeep rename to docs/old/api/.gitkeep diff --git a/docs/api/deepsparse.rst b/docs/old/api/deepsparse.rst similarity index 100% rename from docs/api/deepsparse.rst rename to docs/old/api/deepsparse.rst diff --git a/docs/api/deepsparse.transformers.rst b/docs/old/api/deepsparse.transformers.rst similarity index 100% rename from docs/api/deepsparse.transformers.rst rename to docs/old/api/deepsparse.transformers.rst diff --git a/docs/api/deepsparse.utils.rst b/docs/old/api/deepsparse.utils.rst similarity index 100% rename from docs/api/deepsparse.utils.rst rename to docs/old/api/deepsparse.utils.rst diff --git a/docs/api/modules.rst b/docs/old/api/modules.rst similarity index 100% rename from docs/api/modules.rst rename to docs/old/api/modules.rst diff --git a/docs/conf.py b/docs/old/conf.py similarity index 100% rename from docs/conf.py rename to docs/old/conf.py diff --git a/docs/debugging-optimizing/diagnostics-debugging.md b/docs/old/debugging-optimizing/diagnostics-debugging.md similarity index 100% rename from docs/debugging-optimizing/diagnostics-debugging.md rename to docs/old/debugging-optimizing/diagnostics-debugging.md diff --git a/docs/debugging-optimizing/example-log.md b/docs/old/debugging-optimizing/example-log.md similarity index 100% rename from docs/debugging-optimizing/example-log.md rename to docs/old/debugging-optimizing/example-log.md diff --git a/docs/debugging-optimizing/index.rst b/docs/old/debugging-optimizing/index.rst similarity index 100% rename from docs/debugging-optimizing/index.rst rename to docs/old/debugging-optimizing/index.rst diff --git a/docs/debugging-optimizing/numactl-utility.md b/docs/old/debugging-optimizing/numactl-utility.md similarity index 100% rename from docs/debugging-optimizing/numactl-utility.md rename to docs/old/debugging-optimizing/numactl-utility.md diff --git a/docs/favicon.ico b/docs/old/favicon.ico similarity index 100% rename from docs/favicon.ico rename to docs/old/favicon.ico diff --git a/docs/index.rst b/docs/old/index.rst similarity index 100% rename from docs/index.rst rename to docs/old/index.rst diff --git a/docs/source/c++api-overview.md b/docs/old/source/c++api-overview.md similarity index 100% rename from docs/source/c++api-overview.md rename to docs/old/source/c++api-overview.md diff --git a/docs/source/hardware.md b/docs/old/source/hardware.md similarity index 100% rename from docs/source/hardware.md rename to docs/old/source/hardware.md diff --git a/docs/source/icon-deepsparse.png b/docs/old/source/icon-deepsparse.png similarity index 100% rename from docs/source/icon-deepsparse.png rename to docs/old/source/icon-deepsparse.png diff --git a/docs/source/multi-stream.png b/docs/old/source/multi-stream.png similarity index 100% rename from docs/source/multi-stream.png rename to docs/old/source/multi-stream.png diff --git a/docs/source/scheduler.md b/docs/old/source/scheduler.md similarity index 100% rename from docs/source/scheduler.md rename to docs/old/source/scheduler.md diff --git a/docs/source/single-stream.png b/docs/old/source/single-stream.png similarity index 100% rename from docs/source/single-stream.png rename to docs/old/source/single-stream.png diff --git a/docs/use-cases/README.md b/docs/use-cases/README.md new file mode 100644 index 0000000000..8d7532d398 --- /dev/null +++ b/docs/use-cases/README.md @@ -0,0 +1,90 @@ + + +# Use Cases + +There are three interfaces for interacting with DeepSparse: + +- **Engine** is the lowest-level API that enables you to compile a model and run inference on raw input tensors. + +- **Pipeline** is the default DeepSparse API. Similar to Hugging Face Pipelines, it wraps Engine with task-specific pre-processing and post-processing steps, allowing you to make requests on raw data and receive post-processed predictions. + +- **Server** is a REST API wrapper around Pipelines built on FastAPI and Uvicorn. It enables you to start a model serving endpoint running DeepSparse with a single CLI. + +This directory offers examples using each API in various supported tasks. + +### Supported Tasks + +DeepSparse supports the following tasks out of the box: + +| NLP | CV | +|-----------------------|---------------------------| +| [Text Classification `"text-classification"`](nlp/text-classification.md) | [Image Classification `"image_classification"`](cv/image-classification.md) | +| [Token Classification `"token-classification"`](nlp/token-classification.md) | [Object Detection `"yolo"`](cv/object-detection-yolov5.md) | +| [Sentiment Analysis `"sentiment-analysis"`](nlp/sentiment-analysis.md) | [Instance Segmentation `"yolact"`](cv/image-segmentation-yolact.md) | +| [Question Answering `"question-answering"`](nlp/question-answering.md) | | +| [Zero-Shot Text Classification `"zero-shot-text-classification"`](nlp/zero-shot-text-classification.md) | | +| [Embedding Extraction `"transformers_embedding_extraction"`](nlp/transformers-embedding-extraction.md) | | + +### Examples + +**Pipeline Example** | Sentiment Analysis + +Here's an example of how a task is used to create a Pipeline: + +```python +from deepsparse import Pipeline + +pipeline = Pipeline.create( + task="sentiment_analysis", + model_path="zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none") + +print(pipeline("I love DeepSparse Pipelines!")) +# labels=['positive'] scores=[0.998009443283081] +``` + +**Server Example** | Sentiment Analysis + +Here's an example of how a task is used to create a Server: + +```bash +deepsparse.server \ + --task sentiment_analysis \ + --model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none +``` + +Making a request: + +```python +import requests + +# Uvicorn is running on this port +url = 'http://0.0.0.0:5543/predict' + +# send the data +obj = {"sequences": "Sending requests to DeepSparse Server is fast and easy!"} +resp = requests.post(url=url, json=obj) + +# recieve the post-processed output +print(resp.text) +# >> {"labels":["positive"],"scores":[0.9330279231071472]} +``` + +### Additional Resources + +- [Custom Tasks](../user-guide/deepsparse-pipelines.md#custom-use-case) +- [Pipeline User Guide](../user-guide/deepsparse-pipelines.md) +- [Server User Guide](../user-guide/deepsparse-server.md) diff --git a/docs/use-cases/cv/embedding-extraction.md b/docs/use-cases/cv/embedding-extraction.md new file mode 100644 index 0000000000..ff7e9f7ad1 --- /dev/null +++ b/docs/use-cases/cv/embedding-extraction.md @@ -0,0 +1,130 @@ + + +# Deploying Embedding Extraction Models With DeepSparse +This page explains how to deploy an Embedding Extraction Pipeline with DeepSparse. + +## Installation Requirements +This use case requires the installation of [DeepSparse Server](../../user-guide/installation.md). + +Confirm your machine is compatible with our [hardware requirements](../../user-guide/hardware-support.md). + +## Model Format +The Embedding Extraction Pipeline enables you to generate embeddings in any domain, meaning you can use it with any ONNX model. It (optionally) removes the projection head from the model, such that you can re-use SparseZoo models and custom models you have trained in the embedding extraction scenario. + +There are two options for passing a model to the Embedding Extraction Pipeline: + +- Pass a Local ONNX File +- Pass a SparseZoo Stub (which identifies an ONNX model in the SparseZoo) + +## DeepSparse Pipelines +Pipeline is the default interface for interacting with DeepSparse. + +Like Hugging Face Pipelines, DeepSparse Pipelines wrap pre- and post-processing around the inference performed by the Engine. This creates a clean API that allows you to pass raw text and images to DeepSparse and receive the post-processed predictions, making it easy to add DeepSparse to your application. + +We will use the `Pipeline.create()` constructor to create an instance of an embedding extraction Pipeline with a 95% pruned-quantized version of ResNet-50 trained on `imagenet`. We can then pass images the `Pipeline` and receive the embeddings. All of the pre-processing is handled by the `Pipeline`. + +The Embedding Extraction Pipeline handles some useful actions around inference: + +- First, on initialization, the Pipeline (optionally) removes a projection head from a model. You can use the `emb_extraction_layer` argument to specify which layer to return. If your ONNX model has no projection head, you can set `emb_extraction_layer=None` (the default) to skip this step. + +- Second, as with all DeepSparse Pipelines, it handles pre-processing such that you can pass raw input. You will notice that in addition to the typical task argument used in `Pipeline.create()`, the Embedding Extraction Pipeline includes a `base_task` argument. This argument tells the Pipeline the domain of the model, such that the Pipeline can figure out what pre-processing to do. + +Download an image to use with the Pipeline. +```bash +wget https://huggingface.co/spaces/neuralmagic/image-classification/resolve/main/lion.jpeg +``` + +This is an example of extracting the last layer from ResNet-50: + +```python +from deepsparse import Pipeline + +# this step removes the projection head before compiling the model +rn50_embedding_pipeline = Pipeline.create( + task="embedding-extraction", + base_task="image-classification", # tells the pipeline to expect images and normalize input with ImageNet means/stds + model_path="zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none", + emb_extraction_layer=-3, # extracts last layer before projection head and softmax +) + +# this step runs pre-processing, inference and returns an embedding +embedding = rn50_embedding_pipeline(images="lion.jpeg") +print(len(embedding.embeddings[0][0])) +# 2048 << size of final layer>> +``` + +### Cross Use Case Functionality +Check out the [Pipeline User Guide](../../user-guide/deepsparse-pipelines.md) for more details on configuring the Pipeline. + +## DeepSparse Server +As an alternative to the Python API, DeepSparse Server allows you to serve an Embedding Extraction Pipeline over HTTP. Configuring the server uses the same parameters and schemas as the Pipelines. + +Once launched, a `/docs` endpoint is created with full endpoint descriptions and support for making sample requests. + +This configuration file sets `emb_extraction_layer` to -3: +```yaml +# config.yaml +endpoints: + - task: embedding_extraction + model: zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none + kwargs: + base_task: image_classification + emb_extraction_layer: -3 +``` +Spin up the server: +```bash +deepsparse.server --config_file config.yaml +``` + +Make requests to the server: +```python +import requests, json +url = "http://0.0.0.0:5543/predict/from_files" +paths = ["lion.jpeg"] +files = [("request", open(img, 'rb')) for img in paths] +resp = requests.post(url=url, files=files) +result = json.loads(resp.text) + +print(len(result["embeddings"][0][0])) + +# 2048 << size of final layer>> +``` +## Using a Custom ONNX File +Apart from using models from the SparseZoo, DeepSparse allows you to define custom ONNX files for embedding extraction. + +The first step is to obtain the ONNX model. You can obtain the file by converting your model to ONNX after training. +Click Download on the [ResNet-50 - ImageNet page](https://sparsezoo.neuralmagic.com/models/cv%2Fclassification%2Fresnet_v1-50%2Fpytorch%2Fsparseml%2Fimagenet%2Fpruned95_uniform_quant-none) to download a ONNX ResNet model for demonstration. + +Extract the downloaded file and use the ResNet-50 ONNX model for embedding extraction: +```python +from deepsparse import Pipeline + +# this step removes the projection head before compiling the model +rn50_embedding_pipeline = Pipeline.create( + task="embedding-extraction", + base_task="image-classification", # tells the pipeline to expect images and normalize input with ImageNet means/stds + model_path="resnet.onnx", + emb_extraction_layer=-3, # extracts last layer before projection head and softmax +) + +# this step runs pre-processing, inference and returns an embedding +embedding = rn50_embedding_pipeline(images="lion.jpeg") +print(len(embedding.embeddings[0][0])) +# 2048 +``` +### Cross Use Case Functionality +Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring the Server. diff --git a/docs/use-cases/cv/image-classification.md b/docs/use-cases/cv/image-classification.md new file mode 100644 index 0000000000..6d99374dd2 --- /dev/null +++ b/docs/use-cases/cv/image-classification.md @@ -0,0 +1,283 @@ + + +# Deploying Image Classification Models with DeepSparse + +This page explains how to benchmark and deploy an image classification model with DeepSparse. + +There are three interfaces for interacting with DeepSparse: +- **Engine** is the lowest-level API that enables you to compile a model and run inference on raw input tensors. + +- **Pipeline** is the default DeepSparse API. Similar to Hugging Face Pipelines, it wraps Engine with pre-processing +and post-processing steps, allowing you to make requests on raw data and receive post-processed predictions. + +- **Server** is a REST API wrapper around Pipelines built on [FastAPI](https://fastapi.tiangolo.com/) and [Uvicorn](https://www.uvicorn.org/). It enables you to start a model serving +endpoint running DeepSparse with a single CLI. + +This example uses ResNet-50. For a full list of pre-sparsified image classification models, [check out the SparseZoo](https://sparsezoo.neuralmagic.com/?domain=cv&sub_domain=classification&page=1). + +## Installation Requirements + +This use case requires the installation of [DeepSparse Server](../../user-guide/installation.md). + +Confirm your machine is compatible with our [hardware requirements](../../user-guide/hardware-support.md). + +## Benchmarking + +We can use the benchmarking utility to demonstrate the DeepSparse's performance. We ran the numbers below on an AWS `c6i.2xlarge` instance (4 cores). + +### ONNX Runtime Baseline + +As a baseline, let's check out ONNX Runtime's performance on ResNet-50. Make sure you have ORT installed (`pip install onnxruntime`). + +```bash +deepsparse.benchmark \ + zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/base-none \ + -b 64 -s sync -nstreams 1 \ + -e onnxruntime + +> Original Model Path: zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/base-none +> Batch Size: 64 +> Scenario: sync +> Throughput (items/sec): 71.83 +``` +ONNX Runtime achieves 72 items/second with batch 64. + +### DeepSparse Speedup + +Now, let's run DeepSparse on an inference-optimized sparse version of ResNet-50. This model has been 95% pruned, while retaining >99% accuracy of the dense baseline on the `imagenet` dataset. + +```bash +deepsparse.benchmark \ + zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none \ + -b 64 -s sync -nstreams 1 \ + -e deepsparse + +> Original Model Path: zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none +> Batch Size: 64 +> Scenario: sync +> Throughput (items/sec): 345.69 +``` + +DeepSparse achieves 346 items/second, an 4.8x speed-up over ONNX Runtime! + +## DeepSparse Engine +Engine is the lowest-level API for interacting with DeepSparse. As much as possible, we recommended using the Pipeline API but Engine is available if you want to handle pre- or post-processing yourself. + +With Engine, we can compile an ONNX file and run inference on raw tensors. + +Here's an example, using a 95% pruned-quantized ResNet-50 trained on `imagenet` from SparseZoo: +```python +from deepsparse import Engine +from deepsparse.utils import generate_random_inputs, model_to_path +import numpy as np + +# download onnx from sparsezoo and compile with batchsize 1 +sparsezoo_stub = "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none" +batch_size = 1 +compiled_model = Engine( + model=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=batch_size # defaults to batch size 1 +) + +# input is raw numpy tensors, output is raw scores for classes +inputs = generate_random_inputs(model_to_path(sparsezoo_stub), batch_size) +output = compiled_model(inputs) +print(output) + +# [array([[-7.73529887e-01, 1.67251182e+00, -1.68212160e-01, +# .... +# 1.26290070e-05, 2.30549040e-06, 2.97072188e-06, 1.90549777e-04]], dtype=float32)] +``` +## DeepSparse Pipelines +Pipeline is the default interface for interacting with DeepSparse. + +Like Hugging Face Pipelines, DeepSparse Pipelines wrap pre- and post-processing around the inference performed by the Engine. This creates a clean API that allows you to pass raw text and images to DeepSparse and receive the post-processed predictions, making it easy to add DeepSparse to your application. + +Let's start by downloading a sample image: +```bash +wget https://huggingface.co/spaces/neuralmagic/image-classification/resolve/main/lion.jpeg +``` + +We will use the `Pipeline.create()` constructor to create an instance of an image classification Pipeline with a 90% pruned-quantized version of ResNet-50. We can then pass images to the Pipeline and receive the predictions. All the pre-processing (such as resizing the images and normalizing the inputs) is handled by the `Pipeline`. + +Passing the image as a JPEG to the Pipeline: + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none" +pipeline = Pipeline.create( + task="image_classification", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX +) + +# run inference on image file +prediction = pipeline(images=["lion.jpeg"]) +print(prediction.labels) +# [291] << class index of "lion" in imagenet +``` + +Passing the image as a numpy array to the Pipeline: + +```python +from deepsparse import Pipeline +from PIL import Image +import numpy as np + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none" +pipeline = Pipeline.create( + task="image_classification", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX +) + +im = np.array(Image.open("lion.jpeg")) + +# run inference on image file +prediction = pipeline(images=[im]) +print(prediction.labels) + +# [291] << class index of "lion" in imagenet +``` + +### Use Case Specific Arguments +The Image Classification Pipeline contains additional arguments for configuring a `Pipeline`. + +#### Top K + +The `top_k` argument specifies the number of classes to return in the prediction. + +```python +from deepsparse import Pipeline + +sparsezoo_stub = "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none" +pipeline = Pipeline.create( + task="image_classification", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + top_k=3, +) + +# run inference on image file +prediction = pipeline(images="lion.jpeg") +print(prediction.labels) +# labels=[291, 260, 244] +``` +#### Class Names + +The `class_names` argument defines a dictionary containing the desired class mappings. + +```python +from deepsparse import Pipeline + +classes = {0: 'tench, Tinca tinca',1: 'goldfish, Carassius auratus',2: 'great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias',3: 'tiger shark, Galeocerdo cuvieri',4: 'hammerhead, hammerhead shark',5: 'electric ray, crampfish, numbfish, torpedo',6: 'stingray',7: 'cock', 8: 'hen', 9: 'ostrich, Struthio camelus', 10: 'brambling, Fringilla montifringilla', 11: 'goldfinch, Carduelis carduelis', 12: 'house finch, linnet, Carpodacus mexicanus', 13: 'junco, snowbird', 14: 'indigo bunting, indigo finch, indigo bird, Passerina cyanea', 15: 'robin, American robin, Turdus migratorius', 16: 'bulbul', 17: 'jay', 18: 'magpie', 19: 'chickadee', 20: 'water ouzel, dipper', 21: 'kite', 22: 'bald eagle, American eagle, Haliaeetus leucocephalus', 23: 'vulture', 24: 'great grey owl, great gray owl, Strix nebulosa', 25: 'European fire salamander, Salamandra salamandra', 26: 'common newt, Triturus vulgaris', 27: 'eft', 28: 'spotted salamander, Ambystoma maculatum', 29: 'axolotl, mud puppy, Ambystoma mexicanum', 30: 'bullfrog, Rana catesbeiana', 31: 'tree frog, tree-frog', 32: 'tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui', 33: 'loggerhead, loggerhead turtle, Caretta caretta', 34: 'leatherback turtle, leatherback, leathery turtle, Dermochelys coriacea', 35: 'mud turtle', 36: 'terrapin', 37: 'box turtle, box tortoise', 38: 'banded gecko', 39: 'common iguana, iguana, Iguana iguana', 40: 'American chameleon, anole, Anolis carolinensis', 41: 'whiptail, whiptail lizard', 42: 'agama', 43: 'frilled lizard, Chlamydosaurus kingi', 44: 'alligator lizard', 45: 'Gila monster, Heloderma suspectum', 46: 'green lizard, Lacerta viridis', 47: 'African chameleon, Chamaeleo chamaeleon', 48: 'Komodo dragon, Komodo lizard, dragon lizard, giant lizard, Varanus komodoensis', 49: 'African crocodile, Nile crocodile, Crocodylus niloticus', 50: 'American alligator, Alligator mississipiensis', 51: 'triceratops', 52: 'thunder snake, worm snake, Carphophis amoenus', 53: 'ringneck snake, ring-necked snake, ring snake', 54: 'hognose snake, puff adder, sand viper', 55: 'green snake, grass snake', 56: 'king snake, kingsnake', 57: 'garter snake, grass snake', 58: 'water snake', 59: 'vine snake', 60: 'night snake, Hypsiglena torquata', 61: 'boa constrictor, Constrictor constrictor', 62: 'rock python, rock snake, Python sebae', 63: 'Indian cobra, Naja naja', 64: 'green mamba', 65: 'sea snake', 66: 'horned viper, cerastes, sand viper, horned asp, Cerastes cornutus', 67: 'diamondback, diamondback rattlesnake, Crotalus adamanteus', 68: 'sidewinder, horned rattlesnake, Crotalus cerastes', 69: 'trilobite', 70: 'harvestman, daddy longlegs, Phalangium opilio', 71: 'scorpion', 72: 'black and gold garden spider, Argiope aurantia', 73: 'barn spider, Araneus cavaticus', 74: 'garden spider, Aranea diademata', 75: 'black widow, Latrodectus mactans', 76: 'tarantula', 77: 'wolf spider, hunting spider', 78: 'tick', 79: 'centipede', 80: 'black grouse', 81: 'ptarmigan', 82: 'ruffed grouse, partridge, Bonasa umbellus', 83: 'prairie chicken, prairie grouse, prairie fowl', 84: 'peacock', 85: 'quail', 86: 'partridge', 87: 'African grey, African gray, Psittacus erithacus', 88: 'macaw', 89: 'sulphur-crested cockatoo, Kakatoe galerita, Cacatua galerita', 90: 'lorikeet', 91: 'coucal', 92: 'bee eater', 93: 'hornbill', 94: 'hummingbird', 95: 'jacamar', 96: 'toucan', 97: 'drake', 98: 'red-breasted merganser, Mergus serrator', 99: 'goose', 100: 'black swan, Cygnus atratus', 101: 'tusker', 102: 'echidna, spiny anteater, anteater', 103: 'platypus, duckbill, duckbilled platypus, duck-billed platypus, Ornithorhynchus anatinus', 104: 'wallaby, brush kangaroo', 105: 'koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus', 106: 'wombat', 107: 'jellyfish', 108: 'sea anemone, anemone', 109: 'brain coral', 110: 'flatworm, platyhelminth', 111: 'nematode, nematode worm, roundworm', 112: 'conch', 113: 'snail', 114: 'slug', 115: 'sea slug, nudibranch', 116: 'chiton, coat-of-mail shell, sea cradle, polyplacophore', 117: 'chambered nautilus, pearly nautilus, nautilus', 118: 'Dungeness crab, Cancer magister', 119: 'rock crab, Cancer irroratus', 120: 'fiddler crab', 121: 'king crab, Alaska crab, Alaskan king crab, Alaska king crab, Paralithodes camtschatica', 122: 'American lobster, Northern lobster, Maine lobster, Homarus americanus', 123: 'spiny lobster, langouste, rock lobster, crawfish, crayfish, sea crawfish', 124: 'crayfish, crawfish, crawdad, crawdaddy', 125: 'hermit crab', 126: 'isopod', 127: 'white stork, Ciconia ciconia', 128: 'black stork, Ciconia nigra', 129: 'spoonbill', 130: 'flamingo', 131: 'little blue heron, Egretta caerulea', 132: 'American egret, great white heron, Egretta albus', 133: 'bittern', 134: 'crane', 135: 'limpkin, Aramus pictus', 136: 'European gallinule, Porphyrio porphyrio', 137: 'American coot, marsh hen, mud hen, water hen, Fulica americana', 138: 'bustard', 139: 'ruddy turnstone, Arenaria interpres', 140: 'red-backed sandpiper, dunlin, Erolia alpina', 141: 'redshank, Tringa totanus', 142: 'dowitcher', 143: 'oystercatcher, oyster catcher', 144: 'pelican', 145: 'king penguin, Aptenodytes patagonica', 146: 'albatross, mollymawk', 147: 'grey whale, gray whale, devilfish, Eschrichtius gibbosus, Eschrichtius robustus', 148: 'killer whale, killer, orca, grampus, sea wolf, Orcinus orca', 149: 'dugong, Dugong dugon', 150: 'sea lion', 151: 'Chihuahua', 152: 'Japanese spaniel', 153: 'Maltese dog, Maltese terrier, Maltese', 154: 'Pekinese, Pekingese, Peke', 155: 'Shih-Tzu', 156: 'Blenheim spaniel', 157: 'papillon', 158: 'toy terrier', 159: 'Rhodesian ridgeback', 160: 'Afghan hound, Afghan', 161: 'basset, basset hound', 162: 'beagle', 163: 'bloodhound, sleuthhound', 164: 'bluetick', 165: 'black-and-tan coonhound', 166: 'Walker hound, Walker foxhound', 167: 'English foxhound', 168: 'redbone', 169: 'borzoi, Russian wolfhound', 170: 'Irish wolfhound', 171: 'Italian greyhound', 172: 'whippet', 173: 'Ibizan hound, Ibizan Podenco', 174: 'Norwegian elkhound, elkhound', 175: 'otterhound, otter hound', 176: 'Saluki, gazelle hound', 177: 'Scottish deerhound, deerhound', 178: 'Weimaraner', 179: 'Staffordshire bullterrier, Staffordshire bull terrier', 180: 'American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier', 181: 'Bedlington terrier', 182: 'Border terrier', 183: 'Kerry blue terrier', 184: 'Irish terrier', 185: 'Norfolk terrier', 186: 'Norwich terrier', 187: 'Yorkshire terrier', 188: 'wire-haired fox terrier', 189: 'Lakeland terrier', 190: 'Sealyham terrier, Sealyham', 191: 'Airedale, Airedale terrier', 192: 'cairn, cairn terrier', 193: 'Australian terrier', 194: 'Dandie Dinmont, Dandie Dinmont terrier', 195: 'Boston bull, Boston terrier', 196: 'miniature schnauzer', 197: 'giant schnauzer', 198: 'standard schnauzer', 199: 'Scotch terrier, Scottish terrier, Scottie', 200: 'Tibetan terrier, chrysanthemum dog', 201: 'silky terrier, Sydney silky', 202: 'soft-coated wheaten terrier', 203: 'West Highland white terrier', 204: 'Lhasa, Lhasa apso', 205: 'flat-coated retriever', 206: 'curly-coated retriever', 207: 'golden retriever', 208: 'Labrador retriever', 209: 'Chesapeake Bay retriever', 210: 'German short-haired pointer', 211: 'vizsla, Hungarian pointer', 212: 'English setter', 213: 'Irish setter, red setter', 214: 'Gordon setter', 215: 'Brittany spaniel', 216: 'clumber, clumber spaniel', 217: 'English springer, English springer spaniel', 218: 'Welsh springer spaniel', 219: 'cocker spaniel, English cocker spaniel, cocker', 220: 'Sussex spaniel', 221: 'Irish water spaniel', 222: 'kuvasz', 223: 'schipperke', 224: 'groenendael', 225: 'malinois', 226: 'briard', 227: 'kelpie', 228: 'komondor', 229: 'Old English sheepdog, bobtail', 230: 'Shetland sheepdog, Shetland sheep dog, Shetland', 231: 'collie', 232: 'Border collie', 233: 'Bouvier des Flandres, Bouviers des Flandres', 234: 'Rottweiler', 235: 'German shepherd, German shepherd dog, German police dog, alsatian', 236: 'Doberman, Doberman pinscher', 237: 'miniature pinscher', 238: 'Greater Swiss Mountain dog', 239: 'Bernese mountain dog', 240: 'Appenzeller', 241: 'EntleBucher', 242: 'boxer', 243: 'bull mastiff', 244: 'Tibetan mastiff', 245: 'French bulldog', 246: 'Great Dane', 247: 'Saint Bernard, St Bernard', 248: 'Eskimo dog, husky', 249: 'malamute, malemute, Alaskan malamute', 250: 'Siberian husky', 251: 'dalmatian, coach dog, carriage dog', 252: 'affenpinscher, monkey pinscher, monkey dog', 253: 'basenji', 254: 'pug, pug-dog', 255: 'Leonberg', 256: 'Newfoundland, Newfoundland dog', 257: 'Great Pyrenees', 258: 'Samoyed, Samoyede', 259: 'Pomeranian', 260: 'chow, chow chow', 261: 'keeshond', 262: 'Brabancon griffon', 263: 'Pembroke, Pembroke Welsh corgi', 264: 'Cardigan, Cardigan Welsh corgi', 265: 'toy poodle', 266: 'miniature poodle', 267: 'standard poodle', 268: 'Mexican hairless', 269: 'timber wolf, grey wolf, gray wolf, Canis lupus', 270: 'white wolf, Arctic wolf, Canis lupus tundrarum', 271: 'red wolf, maned wolf, Canis rufus, Canis niger', 272: 'coyote, prairie wolf, brush wolf, Canis latrans', 273: 'dingo, warrigal, warragal, Canis dingo', 274: 'dhole, Cuon alpinus', 275: 'African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus', 276: 'hyena, hyaena', 277: 'red fox, Vulpes vulpes', 278: 'kit fox, Vulpes macrotis', 279: 'Arctic fox, white fox, Alopex lagopus', 280: 'grey fox, gray fox, Urocyon cinereoargenteus', 281: 'tabby, tabby cat', 282: 'tiger cat', 283: 'Persian cat', 284: 'Siamese cat, Siamese', 285: 'Egyptian cat', 286: 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor', 287: 'lynx, catamount', 288: 'leopard, Panthera pardus', 289: 'snow leopard, ounce, Panthera uncia', 290: 'jaguar, panther, Panthera onca, Felis onca', 291: 'lion, king of beasts, Panthera leo', 292: 'tiger, Panthera tigris', 293: 'cheetah, chetah, Acinonyx jubatus', 294: 'brown bear, bruin, Ursus arctos', 295: 'American black bear, black bear, Ursus americanus, Euarctos americanus', 296: 'ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus', 297: 'sloth bear, Melursus ursinus, Ursus ursinus', 298: 'mongoose', 299: 'meerkat, mierkat', 300: 'tiger beetle', 301: 'ladybug, ladybeetle, lady beetle, ladybird, ladybird beetle', 302: 'ground beetle, carabid beetle', 303: 'long-horned beetle, longicorn, longicorn beetle', 304: 'leaf beetle, chrysomelid', 305: 'dung beetle', 306: 'rhinoceros beetle', 307: 'weevil', 308: 'fly', 309: 'bee', 310: 'ant, emmet, pismire', 311: 'grasshopper, hopper', 312: 'cricket', 313: 'walking stick, walkingstick, stick insect', 314: 'cockroach, roach', 315: 'mantis, mantid', 316: 'cicada, cicala', 317: 'leafhopper', 318: 'lacewing, lacewing fly', 319: "dragonfly, darning needle, devil's darning needle, sewing needle, snake feeder, snake doctor, mosquito hawk, skeeter hawk", 320: 'damselfly', 321: 'admiral', 322: 'ringlet, ringlet butterfly', 323: 'monarch, monarch butterfly, milkweed butterfly, Danaus plexippus', 324: 'cabbage butterfly', 325: 'sulphur butterfly, sulfur butterfly', 326: 'lycaenid, lycaenid butterfly', 327: 'starfish, sea star', 328: 'sea urchin', 329: 'sea cucumber, holothurian', 330: 'wood rabbit, cottontail, cottontail rabbit', 331: 'hare', 332: 'Angora, Angora rabbit', 333: 'hamster', 334: 'porcupine, hedgehog', 335: 'fox squirrel, eastern fox squirrel, Sciurus niger', 336: 'marmot', 337: 'beaver', 338: 'guinea pig, Cavia cobaya', 339: 'sorrel', 340: 'zebra', 341: 'hog, pig, grunter, squealer, Sus scrofa', 342: 'wild boar, boar, Sus scrofa', 343: 'warthog', 344: 'hippopotamus, hippo, river horse, Hippopotamus amphibius', 345: 'ox', 346: 'water buffalo, water ox, Asiatic buffalo, Bubalus bubalis', 347: 'bison', 348: 'ram, tup', 349: 'bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis', 350: 'ibex, Capra ibex', 351: 'hartebeest', 352: 'impala, Aepyceros melampus', 353: 'gazelle', 354: 'Arabian camel, dromedary, Camelus dromedarius', 355: 'llama', 356: 'weasel', 357: 'mink', 358: 'polecat, fitch, foulmart, foumart, Mustela putorius', 359: 'black-footed ferret, ferret, Mustela nigripes', 360: 'otter', 361: 'skunk, polecat, wood pussy', 362: 'badger', 363: 'armadillo', 364: 'three-toed sloth, ai, Bradypus tridactylus', 365: 'orangutan, orang, orangutang, Pongo pygmaeus', 366: 'gorilla, Gorilla gorilla', 367: 'chimpanzee, chimp, Pan troglodytes', 368: 'gibbon, Hylobates lar', 369: 'siamang, Hylobates syndactylus, Symphalangus syndactylus', 370: 'guenon, guenon monkey', 371: 'patas, hussar monkey, Erythrocebus patas', 372: 'baboon', 373: 'macaque', 374: 'langur', 375: 'colobus, colobus monkey', 376: 'proboscis monkey, Nasalis larvatus', 377: 'marmoset', 378: 'capuchin, ringtail, Cebus capucinus', 379: 'howler monkey, howler', 380: 'titi, titi monkey', 381: 'spider monkey, Ateles geoffroyi', 382: 'squirrel monkey, Saimiri sciureus', 383: 'Madagascar cat, ring-tailed lemur, Lemur catta', 384: 'indri, indris, Indri indri, Indri brevicaudatus', 385: 'Indian elephant, Elephas maximus', 386: 'African elephant, Loxodonta africana', 387: 'lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens', 388: 'giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca', 389: 'barracouta, snoek', 390: 'eel', 391: 'coho, cohoe, coho salmon, blue jack, silver salmon, Oncorhynchus kisutch', 392: 'rock beauty, Holocanthus tricolor', 393: 'anemone fish', 394: 'sturgeon', 395: 'gar, garfish, garpike, billfish, Lepisosteus osseus', 396: 'lionfish', 397: 'puffer, pufferfish, blowfish, globefish', 398: 'abacus', 399: 'abaya', 400: "academic gown, academic robe, judge's robe", 401: 'accordion, piano accordion, squeeze box', 402: 'acoustic guitar', 403: 'aircraft carrier, carrier, flattop, attack aircraft carrier', 404: 'airliner', 405: 'airship, dirigible', 406: 'altar', 407: 'ambulance', 408: 'amphibian, amphibious vehicle', 409: 'analog clock', 410: 'apiary, bee house', 411: 'apron', 412: 'ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin', 413: 'assault rifle, assault gun', 414: 'backpack, back pack, knapsack, packsack, rucksack, haversack', 415: 'bakery, bakeshop, bakehouse', 416: 'balance beam, beam', 417: 'balloon', 418: 'ballpoint, ballpoint pen, ballpen, Biro', 419: 'Band Aid', 420: 'banjo', 421: 'bannister, banister, balustrade, balusters, handrail', 422: 'barbell', 423: 'barber chair', 424: 'barbershop', 425: 'barn', 426: 'barometer', 427: 'barrel, cask', 428: 'barrow, garden cart, lawn cart, wheelbarrow', 429: 'baseball', 430: 'basketball', 431: 'bassinet', 432: 'bassoon', 433: 'bathing cap, swimming cap', 434: 'bath towel', 435: 'bathtub, bathing tub, bath, tub', 436: 'beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon', 437: 'beacon, lighthouse, beacon light, pharos', 438: 'beaker', 439: 'bearskin, busby, shako', 440: 'beer bottle', 441: 'beer glass', 442: 'bell cote, bell cot', 443: 'bib', 444: 'bicycle-built-for-two, tandem bicycle, tandem', 445: 'bikini, two-piece', 446: 'binder, ring-binder', 447: 'binoculars, field glasses, opera glasses', 448: 'birdhouse', 449: 'boathouse', 450: 'bobsled, bobsleigh, bob', 451: 'bolo tie, bolo, bola tie, bola', 452: 'bonnet, poke bonnet', 453: 'bookcase', 454: 'bookshop, bookstore, bookstall', 455: 'bottlecap', 456: 'bow', 457: 'bow tie, bow-tie, bowtie', 458: 'brass, memorial tablet, plaque', 459: 'brassiere, bra, bandeau', 460: 'breakwater, groin, groyne, mole, bulwark, seawall, jetty', 461: 'breastplate, aegis, egis', 462: 'broom', 463: 'bucket, pail', 464: 'buckle', 465: 'bulletproof vest', 466: 'bullet train, bullet', 467: 'butcher shop, meat market', 468: 'cab, hack, taxi, taxicab', 469: 'caldron, cauldron', 470: 'candle, taper, wax light', 471: 'cannon', 472: 'canoe', 473: 'can opener, tin opener', 474: 'cardigan', 475: 'car mirror', 476: 'carousel, carrousel, merry-go-round, roundabout, whirligig', 477: "carpenter's kit, tool kit", 478: 'carton', 479: 'car wheel', 480: 'cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM', 481: 'cassette', 482: 'cassette player', 483: 'castle', 484: 'catamaran', 485: 'CD player', 486: 'cello, violoncello', 487: 'cellular telephone, cellular phone, cellphone, cell, mobile phone', 488: 'chain', 489: 'chainlink fence', 490: 'chain mail, ring mail, mail, chain armor, chain armour, ring armor, ring armour', 491: 'chain saw, chainsaw', 492: 'chest', 493: 'chiffonier, commode', 494: 'chime, bell, gong', 495: 'china cabinet, china closet', 496: 'Christmas stocking', 497: 'church, church building', 498: 'cinema, movie theater, movie theatre, movie house, picture palace', 499: 'cleaver, meat cleaver, chopper', 500: 'cliff dwelling', 501: 'cloak', 502: 'clog, geta, patten, sabot', 503: 'cocktail shaker', 504: 'coffee mug', 505: 'coffeepot', 506: 'coil, spiral, volute, whorl, helix', 507: 'combination lock', 508: 'computer keyboard, keypad', 509: 'confectionery, confectionary, candy store', 510: 'container ship, containership, container vessel', 511: 'convertible', 512: 'corkscrew, bottle screw', 513: 'cornet, horn, trumpet, trump', 514: 'cowboy boot', 515: 'cowboy hat, ten-gallon hat', 516: 'cradle', 517: 'crane', 518: 'crash helmet', 519: 'crate', 520: 'crib, cot', 521: 'Crock Pot', 522: 'croquet ball', 523: 'crutch', 524: 'cuirass', 525: 'dam, dike, dyke', 526: 'desk', 527: 'desktop computer', 528: 'dial telephone, dial phone', 529: 'diaper, nappy, napkin', 530: 'digital clock', 531: 'digital watch', 532: 'dining table, board', 533: 'dishrag, dishcloth', 534: 'dishwasher, dish washer, dishwashing machine', 535: 'disk brake, disc brake', 536: 'dock, dockage, docking facility', 537: 'dogsled, dog sled, dog sleigh', 538: 'dome', 539: 'doormat, welcome mat', 540: 'drilling platform, offshore rig', 541: 'drum, membranophone, tympan', 542: 'drumstick', 543: 'dumbbell', 544: 'Dutch oven', 545: 'electric fan, blower', 546: 'electric guitar', 547: 'electric locomotive', 548: 'entertainment center', 549: 'envelope', 550: 'espresso maker', 551: 'face powder', 552: 'feather boa, boa', 553: 'file, file cabinet, filing cabinet', 554: 'fireboat', 555: 'fire engine, fire truck', 556: 'fire screen, fireguard', 557: 'flagpole, flagstaff', 558: 'flute, transverse flute', 559: 'folding chair', 560: 'football helmet', 561: 'forklift', 562: 'fountain', 563: 'fountain pen', 564: 'four-poster', 565: 'freight car', 566: 'French horn, horn', 567: 'frying pan, frypan, skillet', 568: 'fur coat', 569: 'garbage truck, dustcart', 570: 'gasmask, respirator, gas helmet', 571: 'gas pump, gasoline pump, petrol pump, island dispenser', 572: 'goblet', 573: 'go-kart', 574: 'golf ball', 575: 'golfcart, golf cart', 576: 'gondola', 577: 'gong, tam-tam', 578: 'gown', 579: 'grand piano, grand', 580: 'greenhouse, nursery, glasshouse', 581: 'grille, radiator grille', 582: 'grocery store, grocery, food market, market', 583: 'guillotine', 584: 'hair slide', 585: 'hair spray', 586: 'half track', 587: 'hammer', 588: 'hamper', 589: 'hand blower, blow dryer, blow drier, hair dryer, hair drier', 590: 'hand-held computer, hand-held microcomputer', 591: 'handkerchief, hankie, hanky, hankey', 592: 'hard disc, hard disk, fixed disk', 593: 'harmonica, mouth organ, harp, mouth harp', 594: 'harp', 595: 'harvester, reaper', 596: 'hatchet', 597: 'holster', 598: 'home theater, home theatre', 599: 'honeycomb', 600: 'hook, claw', 601: 'hoopskirt, crinoline', 602: 'horizontal bar, high bar', 603: 'horse cart, horse-cart', 604: 'hourglass', 605: 'iPod', 606: 'iron, smoothing iron', 607: "jack-o'-lantern", 608: 'jean, blue jean, denim', 609: 'jeep, landrover', 610: 'jersey, T-shirt, tee shirt', 611: 'jigsaw puzzle', 612: 'jinrikisha, ricksha, rickshaw', 613: 'joystick', 614: 'kimono', 615: 'knee pad', 616: 'knot', 617: 'lab coat, laboratory coat', 618: 'ladle', 619: 'lampshade, lamp shade', 620: 'laptop, laptop computer', 621: 'lawn mower, mower', 622: 'lens cap, lens cover', 623: 'letter opener, paper knife, paperknife', 624: 'library', 625: 'lifeboat', 626: 'lighter, light, igniter, ignitor', 627: 'limousine, limo', 628: 'liner, ocean liner', 629: 'lipstick, lip rouge', 630: 'Loafer', 631: 'lotion', 632: 'loudspeaker, speaker, speaker unit, loudspeaker system, speaker system', 633: "loupe, jeweler's loupe", 634: 'lumbermill, sawmill', 635: 'magnetic compass', 636: 'mailbag, postbag', 637: 'mailbox, letter box', 638: 'maillot', 639: 'maillot, tank suit', 640: 'manhole cover', 641: 'maraca', 642: 'marimba, xylophone', 643: 'mask', 644: 'matchstick', 645: 'maypole', 646: 'maze, labyrinth', 647: 'measuring cup', 648: 'medicine chest, medicine cabinet', 649: 'megalith, megalithic structure', 650: 'microphone, mike', 651: 'microwave, microwave oven', 652: 'military uniform', 653: 'milk can', 654: 'minibus', 655: 'miniskirt, mini', 656: 'minivan', 657: 'missile', 658: 'mitten', 659: 'mixing bowl', 660: 'mobile home, manufactured home', 661: 'Model T', 662: 'modem', 663: 'monastery', 664: 'monitor', 665: 'moped', 666: 'mortar', 667: 'mortarboard', 668: 'mosque', 669: 'mosquito net', 670: 'motor scooter, scooter', 671: 'mountain bike, all-terrain bike, off-roader', 672: 'mountain tent', 673: 'mouse, computer mouse', 674: 'mousetrap', 675: 'moving van', 676: 'muzzle', 677: 'nail', 678: 'neck brace', 679: 'necklace', 680: 'nipple', 681: 'notebook, notebook computer', 682: 'obelisk', 683: 'oboe, hautboy, hautbois', 684: 'ocarina, sweet potato', 685: 'odometer, hodometer, mileometer, milometer', 686: 'oil filter', 687: 'organ, pipe organ', 688: 'oscilloscope, scope, cathode-ray oscilloscope, CRO', 689: 'overskirt', 690: 'oxcart', 691: 'oxygen mask', 692: 'packet', 693: 'paddle, boat paddle', 694: 'paddlewheel, paddle wheel', 695: 'padlock', 696: 'paintbrush', 697: "pajama, pyjama, pj's, jammies", 698: 'palace', 699: 'panpipe, pandean pipe, syrinx', 700: 'paper towel', 701: 'parachute, chute', 702: 'parallel bars, bars', 703: 'park bench', 704: 'parking meter', 705: 'passenger car, coach, carriage', 706: 'patio, terrace', 707: 'pay-phone, pay-station', 708: 'pedestal, plinth, footstall', 709: 'pencil box, pencil case', 710: 'pencil sharpener', 711: 'perfume, essence', 712: 'Petri dish', 713: 'photocopier', 714: 'pick, plectrum, plectron', 715: 'pickelhaube', 716: 'picket fence, paling', 717: 'pickup, pickup truck', 718: 'pier', 719: 'piggy bank, penny bank', 720: 'pill bottle', 721: 'pillow', 722: 'ping-pong ball', 723: 'pinwheel', 724: 'pirate, pirate ship', 725: 'pitcher, ewer', 726: "plane, carpenter's plane, woodworking plane", 727: 'planetarium', 728: 'plastic bag', 729: 'plate rack', 730: 'plow, plough', 731: "plunger, plumber's helper", 732: 'Polaroid camera, Polaroid Land camera', 733: 'pole', 734: 'police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria', 735: 'poncho', 736: 'pool table, billiard table, snooker table', 737: 'pop bottle, soda bottle', 738: 'pot, flowerpot', 739: "potter's wheel", 740: 'power drill', 741: 'prayer rug, prayer mat', 742: 'printer', 743: 'prison, prison house', 744: 'projectile, missile', 745: 'projector', 746: 'puck, hockey puck', 747: 'punching bag, punch bag, punching ball, punchball', 748: 'purse', 749: 'quill, quill pen', 750: 'quilt, comforter, comfort, puff', 751: 'racer, race car, racing car', 752: 'racket, racquet', 753: 'radiator', 754: 'radio, wireless', 755: 'radio telescope, radio reflector', 756: 'rain barrel', 757: 'recreational vehicle, RV, R.V.', 758: 'reel', 759: 'reflex camera', 760: 'refrigerator, icebox', 761: 'remote control, remote', 762: 'restaurant, eating house, eating place, eatery', 763: 'revolver, six-gun, six-shooter', 764: 'rifle', 765: 'rocking chair, rocker', 766: 'rotisserie', 767: 'rubber eraser, rubber, pencil eraser', 768: 'rugby ball', 769: 'rule, ruler', 770: 'running shoe', 771: 'safe', 772: 'safety pin', 773: 'saltshaker, salt shaker', 774: 'sandal', 775: 'sarong', 776: 'sax, saxophone', 777: 'scabbard', 778: 'scale, weighing machine', 779: 'school bus', 780: 'schooner', 781: 'scoreboard', 782: 'screen, CRT screen', 783: 'screw', 784: 'screwdriver', 785: 'seat belt, seatbelt', 786: 'sewing machine', 787: 'shield, buckler', 788: 'shoe shop, shoe-shop, shoe store', 789: 'shoji', 790: 'shopping basket', 791: 'shopping cart', 792: 'shovel', 793: 'shower cap', 794: 'shower curtain', 795: 'ski', 796: 'ski mask', 797: 'sleeping bag', 798: 'slide rule, slipstick', 799: 'sliding door', 800: 'slot, one-armed bandit', 801: 'snorkel', 802: 'snowmobile', 803: 'snowplow, snowplough', 804: 'soap dispenser', 805: 'soccer ball', 806: 'sock', 807: 'solar dish, solar collector, solar furnace', 808: 'sombrero', 809: 'soup bowl', 810: 'space bar', 811: 'space heater', 812: 'space shuttle', 813: 'spatula', 814: 'speedboat', 815: "spider web, spider's web", 816: 'spindle', 817: 'sports car, sport car', 818: 'spotlight, spot', 819: 'stage', 820: 'steam locomotive', 821: 'steel arch bridge', 822: 'steel drum', 823: 'stethoscope', 824: 'stole', 825: 'stone wall', 826: 'stopwatch, stop watch', 827: 'stove', 828: 'strainer', 829: 'streetcar, tram, tramcar, trolley, trolley car', 830: 'stretcher', 831: 'studio couch, day bed', 832: 'stupa, tope', 833: 'submarine, pigboat, sub, U-boat', 834: 'suit, suit of clothes', 835: 'sundial', 836: 'sunglass', 837: 'sunglasses, dark glasses, shades', 838: 'sunscreen, sunblock, sun blocker', 839: 'suspension bridge', 840: 'swab, swob, mop', 841: 'sweatshirt', 842: 'swimming trunks, bathing trunks', 843: 'swing', 844: 'switch, electric switch, electrical switch', 845: 'syringe', 846: 'table lamp', 847: 'tank, army tank, armored combat vehicle, armoured combat vehicle', 848: 'tape player', 849: 'teapot', 850: 'teddy, teddy bear', 851: 'television, television system', 852: 'tennis ball', 853: 'thatch, thatched roof', 854: 'theater curtain, theatre curtain', 855: 'thimble', 856: 'thresher, thrasher, threshing machine', 857: 'throne', 858: 'tile roof', 859: 'toaster', 860: 'tobacco shop, tobacconist shop, tobacconist', 861: 'toilet seat', 862: 'torch', 863: 'totem pole', 864: 'tow truck, tow car, wrecker', 865: 'toyshop', 866: 'tractor', 867: 'trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi', 868: 'tray', 869: 'trench coat', 870: 'tricycle, trike, velocipede', 871: 'trimaran', 872: 'tripod', 873: 'triumphal arch', 874: 'trolleybus, trolley coach, trackless trolley', 875: 'trombone', 876: 'tub, vat', 877: 'turnstile', 878: 'typewriter keyboard', 879: 'umbrella', 880: 'unicycle, monocycle', 881: 'upright, upright piano', 882: 'vacuum, vacuum cleaner', 883: 'vase', 884: 'vault', 885: 'velvet', 886: 'vending machine', 887: 'vestment', 888: 'viaduct', 889: 'violin, fiddle', 890: 'volleyball', 891: 'waffle iron', 892: 'wall clock', 893: 'wallet, billfold, notecase, pocketbook', 894: 'wardrobe, closet, press', 895: 'warplane, military plane', 896: 'washbasin, handbasin, washbowl, lavabo, wash-hand basin', 897: 'washer, automatic washer, washing machine', 898: 'water bottle', 899: 'water jug', 900: 'water tower', 901: 'whiskey jug', 902: 'whistle', 903: 'wig', 904: 'window screen', 905: 'window shade', 906: 'Windsor tie', 907: 'wine bottle', 908: 'wing', 909: 'wok', 910: 'wooden spoon', 911: 'wool, woolen, woollen', 912: 'worm fence, snake fence, snake-rail fence, Virginia fence', 913: 'wreck', 914: 'yawl', 915: 'yurt', 916: 'web site, website, internet site, site', 917: 'comic book', 918: 'crossword puzzle, crossword', 919: 'street sign', 920: 'traffic light, traffic signal, stoplight', 921: 'book jacket, dust cover, dust jacket, dust wrapper', 922: 'menu', 923: 'plate', 924: 'guacamole', 925: 'consomme', 926: 'hot pot, hotpot', 927: 'trifle', 928: 'ice cream, icecream', 929: 'ice lolly, lolly, lollipop, popsicle', 930: 'French loaf', 931: 'bagel, beigel', 932: 'pretzel', 933: 'cheeseburger', 934: 'hotdog, hot dog, red hot', 935: 'mashed potato', 936: 'head cabbage', 937: 'broccoli', 938: 'cauliflower', 939: 'zucchini, courgette', 940: 'spaghetti squash', 941: 'acorn squash', 942: 'butternut squash', 943: 'cucumber, cuke', 944: 'artichoke, globe artichoke', 945: 'bell pepper', 946: 'cardoon', 947: 'mushroom', 948: 'Granny Smith', 949: 'strawberry', 950: 'orange', 951: 'lemon', 952: 'fig', 953: 'pineapple, ananas', 954: 'banana', 955: 'jackfruit, jak, jack', 956: 'custard apple', 957: 'pomegranate', 958: 'hay', 959: 'carbonara', 960: 'chocolate sauce, chocolate syrup', 961: 'dough', 962: 'meat loaf, meatloaf', 963: 'pizza, pizza pie', 964: 'potpie', 965: 'burrito', 966: 'red wine', 967: 'espresso', 968: 'cup', 969: 'eggnog', 970: 'alp', 971: 'bubble', 972: 'cliff, drop, drop-off', 973: 'coral reef', 974: 'geyser', 975: 'lakeside, lakeshore', 976: 'promontory, headland, head, foreland', 977: 'sandbar, sand bar', 978: 'seashore, coast, seacoast, sea-coast', 979: 'valley, vale', 980: 'volcano', 981: 'ballplayer, baseball player', 982: 'groom, bridegroom', 983: 'scuba diver', 984: 'rapeseed', 985: 'daisy', 986: "yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum", 987: 'corn', 988: 'acorn', 989: 'hip, rose hip, rosehip', 990: 'buckeye, horse chestnut, conker', 991: 'coral fungus', 992: 'agaric', 993: 'gyromitra', 994: 'stinkhorn, carrion fungus', 995: 'earthstar', 996: 'hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa', 997: 'bolete', 998: 'ear, spike, capitulum', 999: 'toilet tissue, toilet paper, bathroom tissue'} + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none" +pipeline = Pipeline.create( + task="image_classification", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + class_names=classes, +) + +# run inference on image file +prediction = pipeline(images="lion.jpeg") +print(prediction.labels) +# labels=['lion, king of beasts, Panthera leo'] +``` +### Cross Use Case Functionality +Check out the [Pipeline User Guide](../../user-guide/deepsparse-pipelines.md) for more details on configuring a Pipeline. + +## DeepSparse Server +Built on the popular FastAPI and Uvicorn stack, DeepSparse Server enables you to set up a REST endpoint for serving inferences over HTTP. Since DeepSparse Server wraps the Pipeline API, it inherits all the utilities provided by Pipelines. + +The CLI command below launches an image classification pipeline with a 95% pruned ResNet model: + +```bash +deepsparse.server \ + --task image_classification \ + --model_path zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none +``` +You should see Uvicorn report that it is running on http://0.0.0.0:5543. Once launched, a /docs path is created with full endpoint descriptions and support for making sample requests. + +Here is an example client request, using the Python requests library for formatting the HTTP: + +```python +import requests + +url = 'http://0.0.0.0:5543/predict/from_files' +path = ['lion.jpeg'] # just put the name of images in here +files = [('request', open(img, 'rb')) for img in path] +resp = requests.post(url=url, files=files) +print(resp.text) +# {"labels":[291],"scores":[24.185693740844727]} +``` +#### Use Case Specific Arguments + +To use a use-case specific argument, create a server configuration file for passing the argument via kwargs. + +This configuration file sets `top_k` classes to 3: +```yaml +# image_classification-config.yaml +endpoints: + - task: image_classification + model: zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none + kwargs: + top_k: 3 +``` + +Start the server: +```bash +deepsparse.server --config-file image_classification-config.yaml +``` + +Make a request over HTTP: + +```python +import requests + +url = 'http://0.0.0.0:5543/predict/from_files' +path = ['lion.jpeg'] # just put the name of images in here +files = [('request', open(img, 'rb')) for img in path] +resp = requests.post(url=url, files=files) +print(resp.text) +# {"labels":[291,260,244],"scores":[24.185693740844727,18.982254028320312,16.390701293945312]} +``` +## Using a Custom ONNX File +Apart from using models from the SparseZoo, DeepSparse allows you to define custom ONNX files when deploying a model. + +The first step is to obtain the ONNX model. You can obtain the file by converting your model to ONNX after training. +Click Download on the [ResNet-50 - ImageNet page](https://sparsezoo.neuralmagic.com/models/cv%2Fclassification%2Fresnet_v1-50%2Fpytorch%2Fsparseml%2Fimagenet%2Fpruned95_uniform_quant-none) to download a ONNX ResNet model for demonstration. + +Extract the downloaded file and use the ResNet-50 ONNX model for inference: +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +pipeline = Pipeline.create( + task="image_classification", + model_path="resnet.onnx", # sparsezoo stub or path to local ONNX +) + +# run inference on image file +prediction = pipeline(images=["lion.jpeg"]) +print(prediction.labels) +# [291] +``` +### Cross Use Case Functionality + +Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring the Server. diff --git a/docs/use-cases/cv/image-segmentation-yolact.md b/docs/use-cases/cv/image-segmentation-yolact.md new file mode 100644 index 0000000000..cc7be1e044 --- /dev/null +++ b/docs/use-cases/cv/image-segmentation-yolact.md @@ -0,0 +1,251 @@ + + +# Deploying Image Segmentation Models with DeepSparse + +This page explains how to benchmark and deploy an image segmentation with DeepSparse. + +There are three interfaces for interacting with DeepSparse: +- **Engine** is the lowest-level API that enables you to compile a model and run inference on raw input tensors. + +- **Pipeline** is the default DeepSparse API. Similar to Hugging Face Pipelines, it wraps Engine with pre-processing +and post-processing steps, allowing you to make requests on raw data and receive post-processed predictions. + +- **Server** is a REST API wrapper around Pipelines built on [FastAPI](https://fastapi.tiangolo.com/) and [Uvicorn](https://www.uvicorn.org/). It enables you to start a model serving +endpoint running DeepSparse with a single CLI. + +We will walk through an example of each using YOLACT. + +## Installation Requirements + +This use case requires the installation of [DeepSparse Server](../../user-guide/installation.md). + +Confirm your machine is compatible with our [hardware requirements](../../user-guide/hardware-support.md) + +## Benchmarking + +We can use the benchmarking utility to demonstrate the DeepSparse's performance. The numbers below were run on a 4 core `c6i.2xlarge` instance in AWS. + +### ONNX Runtime Baseline + +As a baseline, let's check out ONNX Runtime's performance on YOLACT. Make sure you have ORT installed (`pip install onnxruntime`). + +```bash +deepsparse.benchmark \ + zoo:cv/segmentation/yolact-darknet53/pytorch/dbolya/coco/base-none \ + -b 64 -s sync -nstreams 1 \ + -e onnxruntime + +> Original Model Path: zoo:cv/segmentation/yolact-darknet53/pytorch/dbolya/coco/base-none +> Batch Size: 64 +> Scenario: sync +> Throughput (items/sec): 3.5290 +``` + +ONNX Runtime achieves 3.5 items/second with batch 64. + +### DeepSparse Speedup +Now, let's run DeepSparse on an inference-optimized sparse version of YOLACT. This model has been 82.5% pruned and quantized to INT8, while retaining >99% accuracy of the dense baseline on the `coco` dataset. + +```bash +deepsparse.benchmark \ + zoo:cv/segmentation/yolact-darknet53/pytorch/dbolya/coco/pruned82_quant-none \ + -b 64 -s sync -nstreams 1 \ + -e deepsparse + +> Original Model Path: zoo:cv/segmentation/yolact-darknet53/pytorch/dbolya/coco/pruned82_quant-none +> Batch Size: 64 +> Scenario: sync +> Throughput (items/sec): 23.2061 +``` + +DeepSparse achieves 23 items/second, a 6.6x speed-up over ONNX Runtime! + +## DeepSparse Engine +Engine is the lowest-level API for interacting with DeepSparse. As much as possible, we recommended using the Pipeline API but Engine is available if you want to handle pre- or post-processing yourself. + +With Engine, we can compile an ONNX file and run inference on raw tensors. + +Here's an example, using a 82.5% pruned-quantized YOLACT model from SparseZoo: + +```python +from deepsparse import Engine +from deepsparse.utils import generate_random_inputs, model_to_path +import numpy as np + +# download onnx from sparsezoo and compile with batchsize 1 +sparsezoo_stub = "zoo:cv/segmentation/yolact-darknet53/pytorch/dbolya/coco/pruned82_quant-none" +batch_size = 1 +compiled_model = Engine( + model=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=batch_size # defaults to batch size 1 +) + +# input is raw numpy tensors, output is raw data +inputs = generate_random_inputs(model_to_path(sparsezoo_stub), batch_size) +output = compiled_model(inputs) + +print(output[0].shape) +print(output) + +# (1, 19248, 4) + +# [array([[[ 0.444973 , -0.02015 , -1.3631972 , -0.9219434 ], +# ... +# 9.50585604e-02, 4.13608968e-01, 1.57236055e-01]]]], dtype=float32)] +``` + +## DeepSparse Pipelines +Pipeline is the default interface for interacting with DeepSparse. + +Like Hugging Face Pipelines, DeepSparse Pipelines wrap pre- and post-processing around the inference performed by the Engine. This creates a clean API that allows you to pass raw text and images to DeepSparse and receive the post-processed predictions, making it easy to add DeepSparse to your application. + +Let's start by downloading a sample image: +```bash +wget https://huggingface.co/spaces/neuralmagic/cv-yolact/resolve/main/thailand.jpeg +``` +We will use the `Pipeline.create()` constructor to create an instance of an image segmentation Pipeline with a 82% pruned-quantized version of YOLACT trained on `coco`. We can then pass images to the `Pipeline` and receive the predictions. All the pre-processing (such as resizing the images) is handled by the `Pipeline`. + +```python +from deepsparse.pipeline import Pipeline + +model_stub = "zoo:cv/segmentation/yolact-darknet53/pytorch/dbolya/coco/pruned82_quant-none" +yolact_pipeline = Pipeline.create( + task="yolact", + model_path=model_stub, +) + +images = ["thailand.jpeg"] +predictions = yolact_pipeline(images=images) +# predictions has attributes `boxes`, `classes`, `masks` and `scores` +predictions.classes[0] +# [20,......, 5] +``` + +### Use Case Specific Arguments +The Image Segmentation Pipeline contains additional arguments for configuring a `Pipeline`. + +#### Classes +The `class_names` argument defines a dictionary containing the desired class mappings. + +```python +from deepsparse.pipeline import Pipeline + +model_stub = "zoo:cv/segmentation/yolact-darknet53/pytorch/dbolya/coco/pruned82_quant-none" + +yolact_pipeline = Pipeline.create( + task="yolact", + model_path=model_stub, + class_names="coco", +) + +images = ["thailand.jpeg"] +predictions = yolact_pipeline(images=images, confidence_threshold=0.2, nms_threshold=0.5) +# predictions has attributes `boxes`, `classes`, `masks` and `scores` +predictions.classes[0] +['elephant','elephant','person',...'zebra','stop sign','bus'] +``` + +### Annotate CLI +You can also use the annotate command to have the engine save an annotated photo on disk. +```bash +deepsparse.instance_segmentation.annotate --source thailand.jpeg #Try --source 0 to annotate your live webcam feed +``` +Running the above command will create an `annotation-results` folder and save the annotated image inside. + +If a `--model_filepath` arg isn't provided, then `zoo:cv/segmentation/yolact-darknet53/pytorch/dbolya/coco/pruned82_quant-none` will be used by default. + +![Annotation Results](images/result-0.jpg) + +### Cross Use Case Functionality +Check out the [Pipeline User Guide](../../user-guide/deepsparse-pipelines.md) for more details on configuring a Pipeline. + +## DeepSparse Server +Built on the popular FastAPI and Uvicorn stack, DeepSparse Server enables you to set up a REST endpoint for serving inferences over HTTP. Since DeepSparse Server wraps the Pipeline API, it inherits all the utilities provided by Pipelines. + +The CLI command below launches an image segmentation pipeline with a 82% pruned-quantized YOLACT model: + +```bash +deepsparse.server \ + --task yolact \ + --model_path "zoo:cv/segmentation/yolact-darknet53/pytorch/dbolya/coco/pruned82_quant-none" --port 5543 +``` +Run inference: +```python +import requests +import json + +url = 'http://0.0.0.0:5543/predict/from_files' +path = ['thailand.jpeg'] # list of images for inference +files = [('request', open(img, 'rb')) for img in path] +resp = requests.post(url=url, files=files) +annotations = json.loads(resp.text) # dictionary of annotation results +boxes, classes, masks, scores = annotations["boxes"], annotations["classes"], annotations["masks"], annotations["scores"] +``` +#### Use Case Specific Arguments +To use the `class_names` argument, create a Server configuration file for passing the argument via kwargs. + +This configuration file sets `class_names` to `coco`: + +```yaml +# yolact-config.yaml +endpoints: + - task: yolact + model: zoo:cv/segmentation/yolact-darknet53/pytorch/dbolya/coco/pruned82_quant-none + kwargs: + class_names: coco +``` +Start the server: +```bash +deepsparse.server --config-file yolact-config.yaml +``` +Run inference: +```python +import requests +import json + +url = 'http://0.0.0.0:5543/predict/from_files' +path = ['thailand.jpeg'] # list of images for inference +files = [('request', open(img, 'rb')) for img in path] +resp = requests.post(url=url, files=files) +annotations = json.loads(resp.text) # dictionary of annotation results +boxes, classes, masks, scores = annotations["boxes"], annotations["classes"], annotations["masks"], annotations["scores"] +``` + +## Using a Custom ONNX File +Apart from using models from the SparseZoo, DeepSparse allows you to define custom ONNX files when deploying a model. + +The first step is to obtain the ONNX model. You can obtain the file by converting your model to ONNX after training. +Click Download on the [YOLCAT page](https://sparsezoo.neuralmagic.com/models/cv%2Fsegmentation%2Fyolact-darknet53%2Fpytorch%2Fdbolya%2Fcoco%2Fpruned82_quant-none) to download a ONNX YOLACT model for demonstration. + +Extract the downloaded file and use the YOLACT ONNX model for inference: +```python +from deepsparse.pipeline import Pipeline + +yolact_pipeline = Pipeline.create( + task="yolact", + model_path="yolact.onnx", +) + +images = ["thailand.jpeg"] +predictions = yolact_pipeline(images=images) +# predictions has attributes `boxes`, `classes`, `masks` and `scores` +predictions.classes[0] +# [20,20, .......0, 0,24] +``` +### Cross Use Case Functionality + +Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring the Server. diff --git a/docs/use-cases/cv/images/result-0.jpg b/docs/use-cases/cv/images/result-0.jpg new file mode 100644 index 0000000000..f485f70677 Binary files /dev/null and b/docs/use-cases/cv/images/result-0.jpg differ diff --git a/docs/use-cases/cv/images/result.jpg b/docs/use-cases/cv/images/result.jpg new file mode 100644 index 0000000000..3a6df579ca Binary files /dev/null and b/docs/use-cases/cv/images/result.jpg differ diff --git a/docs/use-cases/cv/object-detection-yolov5.md b/docs/use-cases/cv/object-detection-yolov5.md new file mode 100644 index 0000000000..1843a4d6ee --- /dev/null +++ b/docs/use-cases/cv/object-detection-yolov5.md @@ -0,0 +1,323 @@ + + +# Deploying YOLOv5 Object Detection Models with DeepSparse + +This page explains how to benchmark and deploy a YOLOv5 object detection model with DeepSparse. + +There are three interfaces for interacting with DeepSparse: +- **Engine** is the lowest-level API that enables you to compile a model and run inference on raw input tensors. + +- **Pipeline** is the default DeepSparse API. Similar to Hugging Face Pipelines, it wraps Engine with pre-processing and post-processing steps, allowing you to make requests on raw data and receive post-processed predictions. + +- **Server** is a REST API wrapper around Pipelines built on [FastAPI](https://fastapi.tiangolo.com/) and [Uvicorn](https://www.uvicorn.org/). It enables you to start a model serving endpoint running DeepSparse with a single CLI. + +This example uses YOLOv5s. For a full list of pre-sparsified object detection models, [check out the SparseZoo](https://sparsezoo.neuralmagic.com/?domain=cv&sub_domain=detection&page=1). + +## Installation Requirements + +This use case requires the installation of [DeepSparse Server and YOLO](../../user-guide/installation.md). + +Confirm your machine is compatible with our [hardware requirements](../../user-guide/hardware-support.md) + +## Benchmarking + +We can use the benchmarking utility to demonstrate the DeepSparse's performance. The numbers below were run on a 4 core `c6i.2xlarge` instance in AWS. + +### ONNX Runtime Baseline + +As a baseline, let's check out ONNX Runtime's performance on YOLOv5s. Make sure you have ORT installed (`pip install onnxruntime`). + +```bash +deepsparse.benchmark \ + zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none \ + -b 64 -s sync -nstreams 1 \ + -e onnxruntime + +> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/base-none +> Batch Size: 64 +> Scenario: sync +> Throughput (items/sec): 12.2369 +``` +ONNX Runtime achieves 12 items/second with batch 64. + +### DeepSparse Speedup +Now, let's run DeepSparse on an inference-optimized sparse version of YOLOv5s. This model has been 85% pruned and quantized. + +```bash +deepsparse.benchmark \ + zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned85_quant-none \ + -b 64 -s sync -nstreams 1 + +> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned85_quant-none +> Batch Size: 64 +> Scenario: sync +> Throughput (items/sec): 72.55 +``` +DeepSparse achieves 73 items/second, a 5.5x speed-up over ONNX Runtime! + +## DeepSparse Engine +Engine is the lowest-level API for interacting with DeepSparse. As much as possible, we recommended using the Pipeline API but Engine is available if you want to handle pre- or post-processing yourself. + +With Engine, we can compile an ONNX file and run inference on raw tensors. + +Here's an example, using a 85% pruned-quantized YOLOv5s model from SparseZoo: + +```python +from deepsparse import Engine +from deepsparse.utils import generate_random_inputs, model_to_path +import numpy as np + +# download onnx from sparsezoo and compile with batchsize 1 +sparsezoo_stub = "zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned85_quant-none" +batch_size = 1 +compiled_model = Engine( + model=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=batch_size # defaults to batch size 1 +) +# input is raw numpy tensors, output is raw scores for classes +inputs = generate_random_inputs(model_to_path(sparsezoo_stub), batch_size) +output = compiled_model(inputs) + +print(output[0].shape) +print(output[0]) + +# (1,25200, 85) +# [array([[[5.54789925e+00, 4.28643513e+00, 9.98156166e+00, ..., +# ... +# -6.13238716e+00, -6.80812788e+00, -5.50403357e+00]]]]], dtype=float32)] +``` + +## DeepSparse Pipeline +Pipeline is the default interface for interacting with DeepSparse. + +Like Hugging Face Pipelines, DeepSparse Pipelines wrap pre- and post-processing around the inference performed by the Engine. This creates a clean API that allows you to pass raw text and images to DeepSparse and receive the post-processed predictions, making it easy to add DeepSparse to your application. + +Let's start by downloading a sample image: +```bash +wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg +``` + +We will use the `Pipeline.create()` constructor to create an instance of an object detection Pipeline with a 85% pruned version of YOLOv5s trained on `coco`. We can then pass images to the `Pipeline` and receive the predictions. All the pre-processing (such as resizing the images and running NMS) is handled by the `Pipeline`. + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned85_quant-none" +yolo_pipeline = Pipeline.create( + task="yolo", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX +) +images = ["basilica.jpg"] + +# run inference on image file +pipeline_outputs = yolo_pipeline(images=images) +print(pipeline_outputs.boxes) +print(pipeline_outputs.labels) + +# [[[262.56866455078125, 483.48693108558655, 514.8401184082031, 611.7606239318848], [542.7222747802734, 385.72591066360474, 591.0432586669922, 412.0340189933777], [728.4929351806641, 403.6355793476105, 769.6295471191406, 493.7961976528168], [466.83229064941406, 383.6878204345703, 530.7117462158203, 408.8705735206604], [309.2399597167969, 396.0068359375, 362.10223388671875, 435.58393812179565], [56.86535453796387, 409.39830899238586, 99.50672149658203, 497.8857614994049], [318.8877868652344, 388.9980583190918, 449.08460998535156, 587.5987024307251], [793.9356079101562, 390.5112290382385, 861.0441284179688, 489.4586777687073], [449.93934631347656, 441.90707445144653, 574.4951934814453, 539.5000758171082], [99.09783554077148, 381.93165946006775, 135.13665390014648, 458.19711089134216], [154.37461853027344, 386.8395175933838, 188.95138549804688, 469.1738815307617], [14.558289527893066, 396.7127945423126, 54.229820251464844, 487.2396695613861], [704.1891632080078, 398.2202727794647, 739.6305999755859, 471.5654203891754], [731.9091796875, 380.60836935043335, 761.627197265625, 414.56129932403564]]] << list of bounding boxes >> + +# [['3.0', '2.0', '0.0', '2.0', '2.0', '0.0', '0.0', '0.0', '3.0', '0.0', '0.0', '0.0', '0.0', '0.0']] << list of label ids >> +``` + +### Use Case Specific Arguments +The YOLOv5 pipeline contains additional arguments for configuring a Pipeline. + +#### Image Shape + +DeepSparse runs with static shapes. By default, YOLOv5 inferences run with images of shape 640x640. The Pipeline accepts images of any size and scales the images to image shape specified by the ONNX graph. + +We can override the image shape used by DeepSparse with the `image_size` argument. In the example below, we run the inferences at 320x320. + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned85_quant-none" +yolo_pipeline = Pipeline.create( + task="yolo", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + image_size=(320,320) +) +images = ["basilica.jpg"] + +# run inference on image file +pipeline_outputs = yolo_pipeline(images=images) +print(pipeline_outputs.boxes) +print(pipeline_outputs.labels) +``` + +#### Class Names +We can specify class names for the labels by passing a dictionary. In the example below, we just use +the first 4 classes from COCO for the sake of a quick example. + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned85_quant-none" +yolo_pipeline = Pipeline.create( + task="yolo", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + class_names={"0":"person", "1":"bicycle", "2":"car", "3":"motorcycle"} + +) +images = ["basilica.jpg"] + +# run inference on image file +pipeline_outputs = yolo_pipeline(images=images) +print(pipeline_outputs.labels) +# [['motorcycle', 'car', 'person', 'car', 'car', 'person', 'person', 'person', 'motorcycle', 'person', 'person', 'person', 'person', 'person']] +``` + +#### IOU and Conf Threshold +We can also configure the thresholds for making detections in YOLO. + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned85_quant-none" +yolo_pipeline = Pipeline.create( + task="yolo", + model_path=sparsezoo_stub +) + +images = ["basilica.jpg"] + +# low threshold inference +pipeline_outputs_low_conf = yolo_pipeline(images=images, iou_thres=0.3, conf_thres=0.1) +print(len(pipeline_outputs_low_conf.boxes[0])) +# 37 <> + +# high threshold inference +pipeline_outputs_high_conf = yolo_pipeline(images=images, iou_thres=0.5, conf_thres=0.8) +print(len(pipeline_outputs_high_conf.boxes[0])) +# 1 <> +``` + +### Cross Use Case Functionality + +Check out the [Pipeline User Guide](../../user-guide/deepsparse-pipelines.md) for more details on configuring a Pipeline. + +## DeepSparse Server + +Built on the popular FastAPI and Uvicorn stack, DeepSparse Server enables you to set up a REST endpoint for serving inferences over HTTP. Since DeepSparse Server wraps the Pipeline API, it inherits all the utilities provided by Pipelines. + +Spin up the server: +```bash +deepsparse.server \ + --task yolo \ + --model_path zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned85_quant-none +``` + +Making a request. +```python +import requests +import json + +url = 'http://0.0.0.0:5543/predict/from_files' +path = ['basilica.jpg'] # list of images for inference +files = [('request', open(img, 'rb')) for img in path] +resp = requests.post(url=url, files=files) +annotations = json.loads(resp.text) # dictionary of annotation results +bounding_boxes = annotations["boxes"] +labels = annotations["labels"] +print(labels) + +# [['3.0', '2.0', '2.0', '0.0', '0.0', '2.0', '0.0', '0.0', '0.0', '3.0', '0.0', '0.0', '0.0', '0.0', '3.0', '9.0', '0.0', '2.0', '0.0', '0.0']] +``` + +#### Use Case Specific Arguments + +To use the `image_size` or `class_names` argument, create a Server configuration file for passing the arguments via kwargs. + +This configuration file sets `class_names` to `coco`: + +```yaml +# yolo-config.yaml +endpoints: + - task: yolo + model: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned85_quant-none + kwargs: + class_names: + '0': person + '1': bicycle + '2': car + '3': motorcycle + image_size: 320 +``` + +Start the server: +```bash +deepsparse.server --config-file yolo-config.yaml +``` + +Making a request: +```python +import requests, json + +url = 'http://0.0.0.0:5543/predict/from_files' +path = ['basilica.jpg'] # list of images for inference +files = [('request', open(img, 'rb')) for img in path] +resp = requests.post(url=url, files=files) +annotations = json.loads(resp.text) +bounding_boxes = annotations["boxes"] +labels = annotations["labels"] +print(labels) +# [['person', 'person', 'car', 'person', 'motorcycle', 'person', 'person', 'person', 'motorcycle', 'person']] +``` +## Using a Custom ONNX File +Apart from using models from the SparseZoo, DeepSparse allows you to define custom ONNX files when deploying a model. + +The first step is to obtain the YOLOv5 ONNX model. This could be a YOLOv5 model you have trained and converted to ONNX. +In this case, let's demonstrate by converting a YOLOv5 model to ONNX using the `ultralytics` package: +```python +from ultralytics import YOLO + +# Load a model +model = YOLO("yolov5nu.pt") # load a pretrained model +success = model.export(format="onnx") # export the model to ONNX format +``` +Download a sample image for detection: +```bash +wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg + +``` +Next, run the DeepSparse object detection pipeline with the custom ONNX file: + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +yolo_pipeline = Pipeline.create( + task="yolo", + model_path="yolov5nu.onnx", # sparsezoo stub or path to local ONNX +) +images = ["basilica.jpg"] + +# run inference on image file +pipeline_outputs = yolo_pipeline(images=images) +print(pipeline_outputs.boxes) +print(pipeline_outputs.labels) +# [[[-0.8809833526611328, 5.1244752407073975, 27.885415077209473, 57.20366072654724], [-9.014896631240845, -2.4366320967674255, 21.488688468933105, 37.2245477437973], [14.241515636444092, 11.096746131777763, 30.164274215698242, 22.02291651070118], [7.107024908065796, 5.017698150128126, 15.09239387512207, 10.45704211294651]]] +# [['8367.0', '1274.0', '8192.0', '6344.0']] +``` + +### Cross Use Case Functionality + +Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring a Server. diff --git a/docs/use-cases/nlp/question-answering.md b/docs/use-cases/nlp/question-answering.md new file mode 100644 index 0000000000..4f49cf5cf9 --- /dev/null +++ b/docs/use-cases/nlp/question-answering.md @@ -0,0 +1,316 @@ + + +# Deploying Question Answering Models with DeepSparse + +This page explains how to benchmark and deploy a question answering model with DeepSparse. + +There are three interfaces for interacting with DeepSparse: +- **Engine** is the lowest-level API. It enables you to compile a model and run inference on raw input tensors. + +- **Pipeline** is the default DeepSparse API. Similar to Hugging Face Pipelines, it wraps Engine with pre-processing +and post-processing steps, allowing you to make requests on raw data and receive post-processed predictions. + +- **Server** is a REST API wrapper around Pipelines built on [FastAPI](https://fastapi.tiangolo.com/) and [Uvicorn](https://www.uvicorn.org/). It enables you to start a model serving +endpoint running DeepSparse with a single CLI. + +We will walk through an example of each. + +## Installation Requirements + +This use case requires the installation of [DeepSparse Server](../../user-guide/installation.md). + +Confirm your machine is compatible with our [hardware requirements](../../user-guide/hardware-support.md). + +## Benchmarking + +We can use the benchmarking utility to demonstrate the DeepSparse's performance. We ran the numbers below on a 4 core AWS `c6i.2xlarge` instance. + +### ONNX Runtime Baseline + +As a baseline, let's check out ONNX Runtime's performance on BERT. Make sure you have ORT installed (`pip install onnxruntime`). + +````bash +deepsparse.benchmark \ + zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/base-none \ + -b 64 -s sync -nstreams 1 -i [64,384] \ + -e onnxruntime + +> Original Model Path: zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90_quant-none +> Batch Size: 64 +> Scenario: sync +> Throughput (items/sec): 5.5482 +```` + +ONNX Runtime achieves 5.5 items/second with batch 64 and sequence length 384. + +### DeepSparse Speedup +Now, let's run DeepSparse on an inference-optimized sparse version of BERT. This model has been 90% pruned and quantized, while retaining >99% accuracy of the dense baseline on the [SQuAD](https://huggingface.co/datasets/squad) dataset. +```bash +deepsparse.benchmark \ + zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90_quant-none\ + -b 64 -s sync -nstreams 1 -i [64,384] \ + -e deepsparse + +> Original Model Path: zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/base-none +> Batch Size: 64 +> Scenario: sync +> Throughput (items/sec): 31.6372 + +``` +DeepSparse achieves 31.6 items/second, an 5.8x speed-up over ONNX Runtime! + +## DeepSparse Engine + +Engine is the lowest-level API for interacting with DeepSparse. As much as possible, we recommended using the Pipeline API but Engine is available if you want to handle pre- or post-processing yourself. + +With Engine, we can compile an ONNX file and run inference on raw tensors. + +Here's an example, using a 90% pruned-quantized BERT trained on SQuAD from SparseZoo: +```python +from deepsparse import Engine +from deepsparse.utils import generate_random_inputs, model_to_path +import numpy as np + +# download onnx from sparsezoo and compile with batchsize 1 +sparsezoo_stub = "zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90_quant-none" +batch_size = 1 +complied_model = Engine( + model=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=batch_size # defaults to batch size 1 +) + +# input is raw numpy tensors, output is raw scores for classes +inputs = generate_random_inputs(model_to_path(sparsezoo_stub), batch_size) +output = complied_model(inputs) +print(output) + +# [array([[-6.904723 , -7.2960553, -6.903628 , -6.930577 , -6.899986 , +# ..... +# -6.555915 , -6.6454444, -6.4477777, -6.8030496]], dtype=float32)] +``` + +## DeepSparse Pipelines +Pipeline is the default interface for interacting with DeepSparse. + +Like Hugging Face Pipelines, DeepSparse Pipelines wrap pre- and post-processing around the inference performed by the Engine. This creates a clean API that allows you to pass raw text and images to DeepSparse and receive the post-processed predictions, making it easy to add DeepSparse to your application. + +We will use the `Pipeline.create()` constructor to create an instance of a question answering Pipeline with a 90% pruned-quantized version of BERT trained on SQuAD. We can then pass raw text to the `Pipeline` and receive the predictions. All of the pre-processing (such as tokenizing the input) is handled by the `Pipeline`. +```python +from deepsparse import Pipeline +task = "question-answering" +qa_pipeline = Pipeline.create( + task=task, + model_path="zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90_quant-none", + ) + +q_context = "DeepSparse is sparsity-aware inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application" +question = "What is DeepSparse?" +output = qa_pipeline(question=question, context=q_context) +print(output.answer) +# sparsity-aware inference runtime +``` + +### Use Case Specific Arguments +The Question Answering Pipeline contains additional arguments for configuring a `Pipeline`. + +#### Sequence Length, Question Length + +The `sequence_length` and `max_question_length` arguments adjusts the ONNX graph to handle a specific sequence length. In the DeepSparse Pipelines, the tokenizers pad the input. As such, using shorter sequence lengths will have better performance. + +The example below compiles the model and runs inference with sequence length 64 and truncates any question longer than 32 tokens. + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90_quant-none" +qa_pipeline = Pipeline.create( + task="question-answering", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + sequence_length=64, + max_question_length=32, +) + +# run inference +q_context = "DeepSparse is sparsity-aware inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application" +question = "What is DeepSparse?" +output = qa_pipeline(question=question, context=q_context) +print(output.answer) + +# sparsity-aware inference runtime + +``` +Alternatively, you can pass a list of sequence lengths, creating a "bucketable" pipeline. Under the hood, the DeepSparse Pipeline will compile multiple versions of the engine (utilizing a shared scheduler) and direct inputs towards the smallest bucket into which it fits. + +The example below creates a bucket for smaller input lengths (16 tokens) and for larger input lengths (128 tokens). +```python +from deepsparse import Pipeline, Context + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90_quant-none" +task = "question-answering" + +qa_pipeline = Pipeline.create( + task=task, + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + sequence_length=[64, 128], # creates bucketed pipeline + max_question_length=32, + context=Context(num_streams=1) +) + +# run inference +q_context = "DeepSparse is sparsity-aware inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application" +question = "What is DeepSparse?" +output = qa_pipeline(question=question, context=q_context) +print(output.answer) +# sparsity-aware inference runtime +``` + +#### Document Stride + +If the context is too long to fit in the max sequence length of the model, the DeepSparse Pipeline splits the context into several overlapping chunks and runs the inference on each chunk. The `doc_stride` argument controls the number of token overlaps between the chunks. + +```python +from deepsparse import Pipeline, Context + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90_quant-none" +task = "question-answering" + +qa_pipeline = Pipeline.create( + task=task, + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + sequence_length=24, # creates bucketed pipeline + max_question_length=8, + doc_stride=4 +) + +# run inference +q_context = "I have been trying to accelerate my inference workloads. DeepSparse is a CPU runtime that helps me." +question = "What is DeepSparse?" +output = qa_pipeline(question=question, context=q_context) +print(output.answer) +# CPU runtime +``` + +### Cross Use Case Functionality +Check out the [Pipeline User Guide](../../user-guide/deepsparse-pipelines.md) for more details on configuring a Pipeline. + +## DeepSparse Server + +DeepSparse Server is built on top of FastAPI and Uvicorn, enabling you to set up a REST endpoint for serving inferences over HTTP. Since DeepSparse Server wraps the Pipeline API, it inherits all the utilities provided by Pipelines. + +The CLI command below launches a question answering pipeline with a 90% pruned-quantized BERT model: + +```bash +deepsparse.server \ + --task question-answering \ + --model_path zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90_quant-none # or path/to/onnx +``` +You should see Uvicorn report that it is running on http://0.0.0.0:5543. Once launched, a /docs path is created with full endpoint descriptions and support for making sample requests. + +Here is an example client request, using the Python requests library for formatting the HTTP: +```python +import requests + +# Uvicorn is running on this port +url = 'http://0.0.0.0:5543/predict' + +# send the data +obj = { + "question": "What is DeepSparse?", + "context": "DeepSparse is sparsity-aware inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application", +} + +resp = requests.post(url=url, json=obj) + +# receive the post-processed output +print(resp.text) +# {"score":23.620140075683594,"answer":"sparsity-aware inference runtime","start":14,"end":46} +``` + +#### Use Case Specific Arguments +To use the task specific arguments, create a server configuration file for passing the arguments via `kwargs`. + +This configuration file sets sequence length to 64: +```yaml +# question-answering-config.yaml +endpoints: + - task: question-answering + model: zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90_quant-none + kwargs: + sequence_length: 24 # uses sequence length 64 + max_question_length: 8 + doc_stride: 4 +``` +Spin up the server: + +```bash +deepsparse.server --config-file question-answering-config.yaml +``` +Making a request: +```python +import requests + +# Uvicorn is running on this port +url = "http://localhost:5543/predict" + +# send the data +obj = { + "question": "What is DeepSparse?", + "context": "I have been trying to accelerate my inference workloads. DeepSparse is a CPU runtime that helps me." +} + +resp = requests.post(url, json=obj) +# receive the post-processed output +print(resp.text) +# {"score":19.74649429321289,"answer":"CPU runtime","start":73,"end":84} +``` +## Using a Custom ONNX File +Apart from using models from the SparseZoo, DeepSparse allows you to deploy question answering pipelines with custom ONNX files. + +The first step is to obtain the ONNX model. You can obtain the file by converting your model to ONNX after training. +Click Download on the [DistilBERT - SQuAD page](https://sparsezoo.neuralmagic.com/models/nlp%2Fquestion_answering%2Fdistilbert-none%2Fpytorch%2Fhuggingface%2Fsquad%2Fpruned80_quant-none-vnni) to download an ONNX DistilBERT model for demonstration. + +Extract the downloaded file and create a folder containing the following required files: +- `config.json` +- `tokenizer.json` +- `model.onnx` + +Use the folder as the model path to the question answering pipeline: +```python +from deepsparse import Pipeline +from sparsezoo import Model + +task = "question-answering" +stub = "zoo:nlp/question_answering/distilbert-none/pytorch/huggingface/squad/pruned80_quant-none-vnni" +model = Model(stub) +model_path = f"{model.path}/deployment" +qa_pipeline = Pipeline.create( + task=task, + model_path=model_path, + ) + +q_context = "DeepSparse is sparsity-aware inference runtime offering GPU-class performance on CPUs and APIs to integrate ML into your application" +question = "What is DeepSparse?" +output = qa_pipeline(question=question, context=q_context) +print(output.answer) +# sparsity-aware +``` +### Cross Use Case Functionality + +Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring the Server. diff --git a/docs/use-cases/nlp/sentiment-analysis.md b/docs/use-cases/nlp/sentiment-analysis.md new file mode 100644 index 0000000000..45422a21cf --- /dev/null +++ b/docs/use-cases/nlp/sentiment-analysis.md @@ -0,0 +1,342 @@ + + +# Deploying Sentiment Analysis Models with DeepSparse + +This page explains how to benchmark and deploy a sentiment analysis model with DeepSparse. + +There are three interfaces for interacting with DeepSparse: +- **Engine** is the lowest-level API. It enables you to compile a model and run inference on raw input tensors. + +- **Pipeline** is the default DeepSparse API. Similiar in concept to Hugging Face Pipelines, it wraps Engine with pre-preprocessing and post-processing, allowing you to make requests on raw data and recieve post-processed predictions. + +- **Server** is a REST API wrapper around Pipelines built on FastAPI and Uvicorn. It enables you to stand up a model serving endpoint running DeepSparse with a single CLI. + +## Installation Requirements + +This use case requires the installation of [DeepSparse Server](../../user-guide/installation.md). + +Confirm your machine is compatible with our [hardware requirements](../../user-guide/hardware-support.md). + +## Benchmarking + +We can use the benchmarking utility to demonstrate the DeepSparse's performance. We ran the numbers below on a 4 core AWS `c6i.2xlarge` instance. + +### ONNX Runtime Baseline + +As a baseline, let's check out ONNX Runtime's performance on BERT. Make sure you have ORT installed (`pip install onnxruntime`). + +```bash +deepsparse.benchmark \ + zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/base-none \ + -b 64 -s sync -nstreams 1 \ + -e onnxruntime + +> Original Model Path: zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/base-none +> Batch Size: 64 +> Scenario: sync +> Throughput (items/sec): 19.61 +``` + +ONNX Runtime achieves 20 items/second with batch 64 and sequence length 128. + +### DeepSparse Speedup + +Now, let's run DeepSparse on an inference-optimized sparse version of BERT. This model has been 90% pruned and quantized, while +retaining >99% accuracy of the dense baseline on the SST2 dataset. + +```bash +deepsparse.benchmark \ + zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none \ + -b 64 -s sync -nstreams 1 \ + -e deepsparse + +> Original Model Path: zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none +> Batch Size: 64 +> Scenario: sync +> Throughput (items/sec): 125.80 +``` + +DeepSparse achieves 126 items/second, an 6.4x speed-up over ONNX Runtime! + +## DeepSparse Engine + +Engine is the lowest-level API for interacting with DeepSparse. As much as possible, we recommended you use the Pipeline API but Engine is available as needed if you want to handle pre- or post-processing yourself. + +With Engine, we can compile an ONNX file and run inference on raw tensors. + +Here's an example, using a 90% pruned-quantized BERT trained on SST2 from SparseZoo: + +```python +from deepsparse import Engine +from deepsparse.utils import generate_random_inputs, model_to_path +import numpy as np + +# download onnx from sparsezoo and compile with batchsize 1 +sparsezoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" +batch_size = 1 +compiled_model = Engine( + model=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=batch_size # defaults to batch size 1 +) + +# input is raw numpy tensors, output is raw scores for classes +inputs = generate_random_inputs(model_to_path(sparsezoo_stub), batch_size) +output = compiled_model(inputs) +print(output) + +# >> [array([[-0.3380675 , 0.09602544]], dtype=float32)] +``` + +## DeepSparse Pipelines + +Pipeline is the default interface for interacting with DeepSparse. + +Just like Hugging Face Pipelines, DeepSparse Pipelines wrap pre- and post-processing around the inference performed by the Engine. +This creates a clean API that allows you to pass raw images to DeepSparse and recieve back the post-processed prediction, +making it easy to add DeepSparse to your application. + +We will use the `Pipeline.create()` constructor to create an instance of a sentiment analysis Pipeline +with a 90% pruned-quantized version of BERT trained on SST2. We can then pass the Pipeline raw text and recieve the predictions. +All of the pre-processing (such as tokenizing the input) is handled by the Pipeline. + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" +batch_size = 1 +sa_pipeline = Pipeline.create( + task="sentiment-analysis", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=1 # default batch size is 1 +) + +# run inference +prediction = sa_pipeline("The sentiment analysis pipeline is fast and easy to use") +print(prediction) + +# >>> labels=['positive'] scores=[0.9955807328224182] +``` + +### Use Case Specific Arguments + +The Sentiment Analysis Pipeline contains additional arguments for configuring a Pipeline. + +#### Sequence Length + +DeepSparse uses static input shapes. We can use the `sequence_length` argument to adjust the ONNX graph to handle a specific sequence length. Inside the DeepSparse pipelines, the tokenizers pad the input. As such, using shorter sequence lengths will have better performance. + +The example below compiles the model and runs inference with sequence length 64. + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" +sa_pipeline = Pipeline.create( + task="sentiment-analysis", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=1, # default batch size is 1 + sequence_length=64 # default sequence length is 128 +) + +# run inference on image file +prediction = sa_pipeline("The sentiment analysis pipeline is fast and easy to use") +print(prediction) + +# >>> labels=['positive'] scores=[0.9955807328224182] +``` + +If your input data has a variable distribution of seuqence lengths, you can simulate dynamic shape infernece by passing a list of sequence lengths to DeepSparse, which a "bucketable" pipeline. Under the hood, the DeepSparse Pipeline compile multiple versions of the model at each sequence length (utilizing a shared scheduler) and directs inputs towards the smallest bucket into which it fits. + +The example below creates a bucket for smaller input lengths (16 tokens) and for larger input lengths (128 tokens). + +```python +from deepsparse import Pipeline, Context + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" +sa_pipeline = Pipeline.create( + task="sentiment-analysis", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=1, # default batch size is 1 + sequence_length=[16, 128], # creates bucketed pipeline + context = Context(num_streams=1) # creates scheduler with one stream +) + +# run inference on short sequence +prediction = sa_pipeline("I love short sentences!") +print(prediction) + +# run inference on long sequence +prediction = sa_pipeline("Normal sized sequences take a lot longer to run but are I still like them a lot because of the speedup from DeepSparse") +print(prediction) + +# >>> labels=['positive'] scores=[0.9988369941711426] +# >>> labels=['positive'] scores=[0.9587154388427734] +``` + +#### Return All Scores + +The `return_all_scores` argument allows you to specify whether to return the prediction as the argmax of class predictions or +to return all scores as a list for each result in the batch. + +Here is an example with batch size 1 and batch size 2: + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" +sa_pipeline_b1 = Pipeline.create( + task="sentiment-analysis", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=1, # default batch size is 1 + return_all_scores=True # default is false +) + +# download onnx from sparsezoo and compile with batch size 2 +batch_size = 2 +sa_pipeline_b2 = Pipeline.create( + task="sentiment-analysis", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=batch_size, # default batch size is 1 + return_all_scores=True # default is false +) + +# run inference with b1 +sequences_b1 = ["Returning all scores is a cool configuration option"] +prediction_b1 = sa_pipeline_b1(sequences_b1) +print(prediction_b1) + +# run inference with b2 +sequences_b2 = sequences_b1 * batch_size +prediction_b2 = sa_pipeline_b2(sequences_b2) +print(prediction_b2) + +# >>> labels=[['positive', 'negative']] scores=[[0.9845395088195801, 0.015460520051419735]] +# >>> labels=[['positive', 'negative'], ['positive', 'negative']] scores=[[0.9845395088195801, 0.015460520051419735], [0.9845395088195801, 0.015460520051419735]] +``` + +### Cross Use Case Functionality + +Check out the [Pipeline User Guide](../../user-guide/deepsparse-pipelines.md) for more details on configuring a Pipeline. + +## DeepSparse Server + +Built on the popular FastAPI and Uvicorn stack, DeepSparse Server enables you to set-up a REST endpoint +for serving inferences over HTTP. Since DeepSparse Server wraps the Pipeline API, it +inherits all of the utilities provided by Pipelines. + +The CLI command below launches an sentiment analysis pipeline with a 90% pruned-quantized +BERT model identifed by its SparseZoo stub: + +```bash +deepsparse.server \ + --task sentiment-analysis \ + --model_path "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" # or path/to/onnx +``` + +You should see Uvicorn report that it is running on `http://0.0.0.0:5543`. Once launched, a `/docs` path is +created with full endpoint descriptions and support for making sample requests. + +Here is an example client request, using the Python `requests` library for formatting the HTTP: +```python +import requests + +# Uvicorn is running on this port +url = 'http://0.0.0.0:5543/predict' + +# send the data +obj = {"sequences": "Sending requests to DeepSparse Server is fast and easy!"} +resp = requests.post(url=url, json=obj) + +# recieve the post-processed output +print(resp.text) +# >> {"labels":["positive"],"scores":[0.9330279231071472]} +``` + +### Use Case Specific Arguments + +To use the `sequence_length` and `return_all_scores` arguments, we can a Server configuration file, passing the arguments via `kwargs` + +This configuration file sets sequence length to 64 and returns all scores: + +```yaml +# sentiment-analysis-config.yaml +endpoints: + - task: sentiment-analysis + model: zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none + kwargs: + sequence_length: 64 # uses sequence length 64 + return_all_scores: True # returns all scores +``` + +Spinning up: +```bash +deepsparse.server \ + --config-file sentiment-analysis-config.yaml +``` + +Making a request: +```python +import requests + +# Uvicorn is running on this port +url = 'http://0.0.0.0:5543/predict' + +# send the data +obj = {"sequences": "Sending requests to DeepSparse Server is fast and easy!"} +resp = requests.post(url=url, json=obj) + +# recieve the post-processed output +print(resp.text) +# >> {"labels":[["positive","negative"]],"scores":[[0.9330279231071472,0.06697207689285278]]} +``` +## Using a Custom ONNX File +Apart from using models from the SparseZoo, DeepSparse allows you to deploy sentiment analysis pipelines with custom ONNX files. + +The first step is to obtain the ONNX model. You can obtain the file by converting your model to ONNX after training. +Click Download on the [oBERT base uncased - sst2 page](https://sparsezoo.neuralmagic.com/models/nlp%2Fsentiment_analysis%2Fobert-base%2Fpytorch%2Fhuggingface%2Fsst2%2Fpruned90_quant-none) +to download an ONNX oBERT base uncased model for demonstration. + +Extract the downloaded file and create a folder containing the following required files: +- `config.json` +- `tokenizer.json` +- `model.onnx` + +Use the folder as the model path to the sentiment analysis pipeline: +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +batch_size = 1 +sa_pipeline = Pipeline.create( + task="sentiment-analysis", + model_path="sentiment-analysis", # sparsezoo stub or path to local ONNX + batch_size=1 # default batch size is 1 +) + +# run inference +prediction = sa_pipeline("The sentiment analysis pipeline is fast and easy to use") +print(prediction) +# labels=['positive'] scores=[0.9955807328224182] + +``` +### Cross Use Case Functionality + +Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring the Server. diff --git a/docs/use-cases/nlp/text-classification.md b/docs/use-cases/nlp/text-classification.md new file mode 100644 index 0000000000..26d8ecfe83 --- /dev/null +++ b/docs/use-cases/nlp/text-classification.md @@ -0,0 +1,438 @@ + + +# Deploying Text Classification Models with DeepSparse + +This page explains how to benchmark and deploy a text classification model with DeepSparse. + +There are three interfaces for interacting with DeepSparse: +- **Engine** is the lowest-level API. It enables you to compile a model and run inference on raw input tensors. + +- **Pipeline** is the default DeepSparse API. Similar to Hugging Face Pipelines, it wraps Engine with pre-processing +and post-processing steps, allowing you to make requests on raw data and receive post-processed predictions. + +- **Server** is a REST API wrapper around Pipelines built on [FastAPI](https://fastapi.tiangolo.com/) and [Uvicorn](https://www.uvicorn.org/). It enables you to start a model serving +endpoint running DeepSparse with a single CLI. + +## Installation Requirements + +This use case requires the installation of [DeepSparse Server](../../user-guide/installation.md). + +Confirm your machine is compatible with our [hardware requirements](../../user-guide/hardware-support.md). + +## Benchmarking + +We can use the benchmarking utility to demonstrate the DeepSparse's performance. We ran the numbers below on a 4 core AWS `c6i.2xlarge` instance. + +### ONNX Runtime Baseline + +As a baseline, let's check out ONNX Runtime's performance on oBERT. Make sure you have ORT installed (`pip install onnxruntime`). + +```bash +deepsparse.benchmark \ + zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/base-none \ + -b 64 -s sync -nstreams 1 \ + -e onnxruntime + +> Original Model Path: zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/base-none +> Batch Size: 64 +> Scenario: sync +> Throughput (items/sec): 19.3496 +``` + +ONNX Runtime achieves 19 items/second with batch 64 and sequence length 128. + +### DeepSparse Speedup +Now, let's run DeepSparse on an inference-optimized sparse version of oBERT. This model has been 90% pruned and quantized, while retaining >99% accuracy of the dense baseline on the MNLI dataset. +```bash +deepsparse.benchmark \ + zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none \ + -b 64 -s sync -nstreams 1 \ + -e deepsparse + +> Original Model Path: zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none +> Batch Size: 64 +> Scenario: sync +> Throughput (items/sec): 124.0120 +``` +DeepSparse achieves 124 items/second, an 6.5x speed-up over ONNX Runtime! + +## DeepSparse Engine +Engine is the lowest-level API for interacting with DeepSparse. As much as possible, we recommended you use the Pipeline API but Engine is available as needed if you want to handle pre- or post-processing yourself. + +With Engine, we can compile an ONNX file and run inference on raw tensors. + +Here's an example, using a 90% pruned-quantized oBERT trained on MNLI from SparseZoo: +```python +from deepsparse import Engine +from deepsparse.utils import generate_random_inputs, model_to_path +import numpy as np + +# download onnx from sparsezoo and compile with batchsize 1 +sparsezoo_stub = "zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none" +batch_size = 1 +compiled_model = Engine( + model=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=batch_size # defaults to batch size 1 +) + +# input is raw numpy tensors, output is raw scores for classes +inputs = generate_random_inputs(model_to_path(sparsezoo_stub), batch_size) +output = compiled_model(inputs) +print(output) +# [array([[-0.9264987, -1.6990623, 2.3935342]], dtype=float32)] + +``` +## DeepSparse Pipelines +Pipeline is the default interface for interacting with DeepSparse. + +Like Hugging Face Pipelines, DeepSparse Pipelines wrap pre- and post-processing around the inference performed by the Engine. This creates a clean API that allows you to pass raw text and images to DeepSparse and receive the post-processed predictions, making it easy to add DeepSparse to your application. + +We will use the `Pipeline.create()` constructor to create an instance of a text classification Pipeline with a 90% pruned-quantized version of oBERT. We can then pass raw text to the `Pipeline` and receive the predictions. All of the pre-processing (such as tokenizing the input) is handled by the `Pipeline`. + +The Text Classification Pipeline can handle multi-input and single-input as well as single-label and multi-label classification. + +#### Single-Input Single-Label Example (SST2) + +Here's an example with a single input and single label prediction with a model trained on SST2: + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" +pipeline = Pipeline.create( + task="text-classification", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=1 # default batch size is 1 +) + +# run inference +sequences = ["I think DeepSparse Pipelines are awesome!"] +prediction = pipeline(sequences) +print(prediction) +# labels=['positive'] scores=[0.9986492991447449] + +``` + +#### Multi-Input Single-Label Example (MNLI) + +Here's an example with a single input and single label prediction with a model trained on MNLI: + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none" +pipeline = Pipeline.create( + task="text-classification", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=1 # default batch size is 1 +) + +# run inference +sequences = [[ + "The text classification pipeline is fast and easy to use!", + "The pipeline for text classification makes it simple to get started" +]] +prediction = pipeline(sequences) +print(prediction) + +# labels=['entailment'] scores=[0.6885718107223511] +``` + +#### Multi-Input Single-Label Example (QQP) + +Here's an example with a single input and single label prediction with a model trained on QQP: + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/text_classification/obert-base/pytorch/huggingface/qqp/pruned90_quant-none" +pipeline = Pipeline.create( + task="text-classification", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=1 # default batch size is 1 +) + +# run inference +sequences = [[ + "Which is the best gaming laptop under 40k?", + "Which is the best gaming laptop under 40,000 rs?", +]] +prediction = pipeline(sequences) +print(prediction) + +# labels=['duplicate'] scores=[0.9978139996528625] +``` + +#### Single-Input Multi-Label Example (GoEmotions) + +Here's an example with a single input and multi label prediction with a model trained on GoEmotions: + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/multilabel_text_classification/obert-base/pytorch/huggingface/goemotions/pruned90_quant-none" +pipeline = Pipeline.create( + task="text-classification", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=1 # default batch size is 1 +) + +# run inference +prediction = pipeline(["I am so glad you came today"]) +print(prediction) + +# labels=['joy'] scores=[0.9472543597221375] +``` + +### Use Case Specific Arguments +The Text Classification Pipeline contains additional arguments for configuring a `Pipeline`. + +#### Sequence Length +The `sequence_length` argument adjusts the ONNX graph to handle a specific sequence length. In the DeepSparse Pipelines, the tokenizers pad the input. As such, using shorter sequence lengths will have better performance. The defaul sequence length for text classification is 128. + +The example below runs document classification using a model trained on IMBD at sequence length 512. + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/document_classification/obert-base/pytorch/huggingface/imdb/pruned90_quant-none" +pipeline = Pipeline.create( + task="text_classification", + model_path=sparsezoo_stub, + sequence_length=512, + batch_size=1, +) + +# run inference +sequences = [[ + "I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered 'controversial' I really had to see this for myself.

The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.

What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.

I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn't have much of a plot." +]] +prediction = pipeline(sequences) +print(prediction) +# labels=['0'] scores=[0.9984526634216309] (negative prediction) +``` + +Alternatively, you can pass a list of sequence lengths, creating a "bucketable" pipeline. Under the DeepSparse Pipeline will compile multiple versions of the model (utilizing a shared scheduler) and direct inputs towards the smallest bucket into which an input fits. + +The example below creates a bucket for smaller input lengths (16 tokens) and for larger input lengths (128 tokens). +```python +from deepsparse import Pipeline, Context + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none" +pipeline = Pipeline.create( + task="text-classification", + model_path=sparsezoo_stub, + batch_size=1, + sequence_length=[32, 128], + context=Context(num_streams=1) +) +# run inference +prediction = pipeline([[ + "Timely access to information is in the best interests of the agencies", + "It is everyone's best interest to get info in a timely manner", + ]]) +print(prediction) + +# run inference +prediction = pipeline([[ + "Timely access to information is in the best interests of both GAO and the agencies. Let's make information more accessible", + "It is in everyone's best interest to have access to information in a timely manner. Information should be made more accessible.", + ]]) +print(prediction) +#labels=['entailment'] scores=[0.9688315987586975] +#labels=['entailment'] scores=[0.985545814037323] +``` + +#### Return All Scores +The `return_all_scores` argument allows you to specify whether to return the prediction as the `argmax` of class predictions or to return all scores as a list for each result in the batch. + +Here is an example with batch size 1 and batch size 2: +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none" +pipeline_b1 = Pipeline.create( + task="text-classification", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=1, # default batch size is 1 + return_all_scores=True # default is false +) + +# download onnx from sparsezoo and compile with batch size 2 +pipeline_b2 = Pipeline.create( + task="text-classification", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=2, # default batch size is 1 + return_all_scores=True # default is false +) + +# run inference with b1 +sequences_b1 = [[ + "Timely access to information is in the best interests of both GAO and the agencies", + "It is in everyone's best interest to have access to information in a timely manner", + ]] +prediction_b1 = pipeline_b1(sequences_b1) +print(prediction_b1) + +# run inference with b2 +sequences_b2 = sequences_b1 * 2 +prediction_b2 = pipeline_b2(sequences_b2) +print(prediction_b2) +# labels=[['entailment', 'neutral', 'contradiction']] scores=[[0.9688315987586975, 0.030656637623906136, 0.0005117706023156643]] +# labels=[['entailment', 'neutral', 'contradiction'], ['entailment', 'neutral', 'contradiction']] scores=[[0.9688315987586975, 0.030656637623906136, 0.0005117706023156643], [0.9688315987586975, 0.030656637623906136, 0.0005117706023156643]] +``` + +### Cross Use Case Functionality +Check out the [Pipeline User Guide](../../user-guide/deepsparse-pipelines.md) for more details on configuring a Pipeline. + +## DeepSparse Server +Built on the popular FastAPI and Uvicorn stack, DeepSparse Server enables you to set up a REST endpoint for serving inferences over HTTP. DeepSparse Server wraps the Pipeline API, so it inherits all the utilities provided by Pipelines. + +#### Single Input Usage + +The CLI command below launches a single-input text classification pipeline with a 90% pruned-quantized oBERT model trained on SST2: + +```bash +deepsparse.server \ + --task text-classification \ + --model_path "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" # or path/to/onnx +``` + +Making a request: + +```python +import requests + +# Uvicorn is running on this port +url = 'http://0.0.0.0:5543/predict' + +# send the data +obj = {"sequences": "Sending requests to DeepSparse Server is fast and easy!"} +resp = requests.post(url=url, json=obj) + +# recieve the post-processed output +print(resp.text) +# {"labels":["positive"],"scores":[0.9330279231071472]} +``` + +#### Multi-Input Usage + +The CLI command below launches a single-input text classification pipeline with a 90% pruned-quantized oBERT model trained on MNLI: + +```bash +deepsparse.server \ + --task text-classification \ + --model_path "zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none" # or path/to/onnx +``` + +Making a request: +```python +import requests + +# Uvicorn is running on this port +url = 'http://0.0.0.0:5543/predict' + +# send the data +obj = { + "sequences": [[ + "The text classification pipeline is fast and easy to use!", + "The pipeline for text classification makes it simple to get started" +]]} +resp = requests.post(url=url, json=obj) + +# recieve the post-processed output +print(resp.text) +# {"labels":["entailment"],"scores":[0.6885718107223511]} +``` + +### Use Case Specific Arguments +To use the `sequence_length` and `return_all_scores` arguments, create a Server configuration file for passing the arguments via kwargs. + +This configuration file sets sequence length to 64 and returns all scores: +```yaml + # text-classification-config.yaml +endpoints: + - task: text-classification + model: zoo:nlp/document_classification/obert-base/pytorch/huggingface/imdb/pruned90_quant-none + kwargs: + sequence_length: 64 # uses sequence length 64 + return_all_scores: True # returns all scores +``` + +Spin up the server: +```bash +deepsparse.server --config-file config.yaml + +``` + +Making a request: +```python +import requests + +# Uvicorn is running on this port +url = 'http://0.0.0.0:5543/predict' + +# send the data +obj = {"sequences": "Sending requests to DeepSparse Server is fast and easy!"} +resp = requests.post(url=url, json=obj) + +# recieve the post-processed output +print(resp.text) +# {"labels":[["1","0"]],"scores":[[0.9941965341567993,0.005803497973829508]]} + +``` +## Using a Custom ONNX File +Apart from using models from the SparseZoo, DeepSparse allows you to deploy text classification pipelines with custom ONNX files. + +The first step is to obtain the ONNX model. You can obtain the file by converting your model to ONNX after training. +Click Download on the [BERT base uncased page](https://sparsezoo.neuralmagic.com/models/nlp%2Ftext_classification%2Fbert-base%2Fpytorch%2Fhuggingface%2Fsst2%2Fbase-none) +to download an ONNX BERT base uncased model for demonstration. + +Extract the downloaded file and create a folder containing the following required files: +- `config.json` +- `tokenizer.json` +- `model.onnx` + +Use the folder as the model path to the text classification pipeline: +```python +from deepsparse import Pipeline + +from sparsezoo import Model + +# download onnx from sparsezoo and compile with batch size 1 +pipeline = Pipeline.create( + task="text-classification", + model_path="text-classification", # sparsezoo stub or path to local ONNX + batch_size=1 # default batch size is 1 +) + +# run inference +sequences = ["I think DeepSparse Pipelines are awesome!"] +prediction = pipeline(sequences) +print(prediction) +# labels=['LABEL_1'] scores=[0.9996163845062256] +``` +### Cross Use Case Functionality + +Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring the Server. diff --git a/docs/use-cases/nlp/token-classification.md b/docs/use-cases/nlp/token-classification.md new file mode 100644 index 0000000000..a6a76c572a --- /dev/null +++ b/docs/use-cases/nlp/token-classification.md @@ -0,0 +1,297 @@ + + +# Deploying Token Classification Models with DeepSparse + +This page explains how to benchmark and deploy a token classification model with DeepSparse. + +There are three interfaces for interacting with DeepSparse: +- **Engine** is the lowest-level API that enables you to compile a model and run inference on raw input tensors. + +- **Pipeline** is the default DeepSparse API. Similar to Hugging Face Pipelines, it wraps Engine with pre-processing +and post-processing steps, allowing you to make requests on raw data and receive post-processed predictions. + +- **Server** is a REST API wrapper around Pipelines built on [FastAPI](https://fastapi.tiangolo.com/) and [Uvicorn](https://www.uvicorn.org/). It enables you to start a model serving +endpoint running DeepSparse with a single CLI. + +## Installation Requirements + +This use case requires the installation of [DeepSparse Server](../../user-guide/installation.md). + +Confirm your machine is compatible with our [hardware requirements](../../user-guide/hardware-support.md). + +## Benchmarking + +We can use the benchmarking utility to demonstrate the DeepSparse's performance. We ran the numbers below on a 4 core AWS `c6i.2xlarge` instance. + +### ONNX Runtime Baseline +As a baseline, let's check out ONNX Runtime's performance on BERT. Make sure you have ORT installed (`pip install onnxruntime`). + +```bash +deepsparse.benchmark \ + zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/base-none \ + -b 64 -s sync -nstreams 1 -i [64,128] \ + -e onnxruntime + +> Original Model Path: zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/base-none +> Batch Size: 64 +> Scenario: sync +> Throughput (items/sec): 19.96 +``` +ONNX Runtime achieves 20 items/second with batch 64 and sequence length 128. + +## DeepSparse Speedup + +Now, let's run DeepSparse on an inference-optimized sparse version of BERT. This model has been 90% pruned and quantized, while retaining >99% accuracy of the dense baseline on the conll dataset. + +```bash +deepsparse.benchmark \ + zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/pruned90_quant-none \ + -b 64 -s sync -nstreams 1 -i [64,128] \ + -e deepsparse + +> Original Model Path: Original Model Path: zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/pruned90_quant-none +> Batch Size: 64 +> Scenario: sync +> Throughput (items/sec): 126.5129 +``` + +DeepSparse achieves 127 items/second, a 6.4x speed-up over ONNX Runtime! + +## DeepSparse Engine + +Engine is the lowest-level API for interacting with DeepSparse. As much as possible, we recommended using the Pipeline API but Engine is available if you want to handle pre- or post-processing yourself. + +With Engine, we can compile an ONNX file and run inference on raw tensors. + +Here's an example, using a 80% pruned-quantized BERT trained on conll2003 from SparseZoo: +```python +from deepsparse import Engine +from deepsparse.utils import generate_random_inputs, model_to_path +import numpy as np + +# download onnx from sparsezoo and compile with batchsize 1 +sparsezoo_stub = "zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/pruned90_quant-none" +batch_size = 1 +compiled_model = Engine( + model=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=batch_size # defaults to batch size 1 +) + +# input is raw numpy tensors, output is raw scores for classes +inputs = generate_random_inputs(model_to_path(sparsezoo_stub), batch_size) +output = compiled_model(inputs) +print(output) +# array([[[ 2.0983224 , 1.2409506 , -1.7314302 , ..., -0.07210742, +#... +# -2.0502508 , -2.956191 ]]], dtype=float32)] +``` + +## DeepSparse Pipelines +Pipeline is the default interface for interacting with DeepSparse. + +Like Hugging Face Pipelines, DeepSparse Pipelines wrap pre- and post-processing around the inference performed by the Engine. This creates a clean API that allows you to pass raw text and images to DeepSparse and receive the post-processed predictions, making it easy to add DeepSparse to your application. + +We will use the `Pipeline.create()` constructor to create an instance of a token classification Pipeline with a 90% pruned-quantized version of BERT trained on conll2003. We can then pass raw text to the `Pipeline` and receive the predictions. All of the pre-processing (such as tokenizing the input) is handled by the `Pipeline`. + +```python +from deepsparse import Pipeline +model_path = "zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/pruned90_quant-none" +pipeline = Pipeline.create( + task="token_classification", + model_path=model_path, + ) +output = pipeline("Mary is flying from Nairobi to New York") +print(output.predictions) +# [[TokenClassificationResult(entity='B-PER', score=0.9971914291381836, word='mary', start=0, end=4, index=1, is_grouped=False), +# TokenClassificationResult(entity='B-LOC', score=0.9993892312049866, word='nairobi', start=20, end=27, index=5, is_grouped=False), +# TokenClassificationResult(entity='B-LOC', score=0.9993736147880554, word='new', start=31, end=34, index=7, is_grouped=False), +# TokenClassificationResult(entity='I-LOC', score=0.997299075126648, word='york', start=35, end=39, index=8, is_grouped=False)]] +``` + +### Use Case Specific Arguments +The Token Classification Pipeline contains additional arguments for configuring a `Pipeline`. + +#### Sequence Length +The `sequence_length` argument adjusts the ONNX graph to handle a specific sequence length. In the DeepSparse Pipelines, the tokenizers pad the input. As such, using shorter sequence lengths will have better performance. + +The example below compiles the model and runs inference with sequence length of 64. +```python +from deepsparse import Pipeline +model_path = "zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/pruned90_quant-none" +pipeline = Pipeline.create( + task="token_classification", + model_path=model_path, + sequence_length=64 + ) +print(output.predictions) +# [[TokenClassificationResult(entity='B-PER', score=0.9971914291381836, word='mary', start=0, end=4, index=1, is_grouped=False), +# TokenClassificationResult(entity='B-LOC', score=0.9993892312049866, word='nairobi', start=20, end=27, index=5, is_grouped=False), +# TokenClassificationResult(entity='B-LOC', score=0.9993736147880554, word='new', start=31, end=34, index=7, is_grouped=False), +# TokenClassificationResult(entity='I-LOC', score=0.997299075126648, word='york', start=35, end=39, index=8, is_grouped=False)]] +``` + +Alternatively, you can pass a list of sequence lengths, creating a "bucketable" pipeline. Under the hood, the DeepSparse Pipeline will compile multiple versions of the model (utilizing a shared scheduler) and direct inputs towards the smallest bucket into which it fits. + +The example below creates a bucket for smaller input lengths (64 tokens) and for larger input lengths (128 tokens). +```python +from deepsparse import Pipeline, Context +model_path = "zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/pruned90_quant-none" +pipeline = Pipeline.create( + task="token_classification", + model_path=model_path, + sequence_length = [64,128], + context=Context(num_streams=1) + ) +output = pipeline("Mary is flying from Nairobi to New York to attend a conference") +print(output.predictions) +# [[TokenClassificationResult(entity='B-PER', score=0.9971914291381836, word='mary', start=0, end=4, index=1, is_grouped=False), +# TokenClassificationResult(entity='B-LOC', score=0.9993892312049866, word='nairobi', start=20, end=27, index=5, is_grouped=False), +# TokenClassificationResult(entity='B-LOC', score=0.9993736147880554, word='new', start=31, end=34, index=7, is_grouped=False), +# TokenClassificationResult(entity='I-LOC', score=0.997299075126648, word='york', start=35, end=39, index=8, is_grouped=False)]] +``` + +#### Aggregation Strategy + +`aggregation_strategy` specifies how to aggregate tokens in post-processing in a case where a single word is split into multiple tokens by the tokenizer. The default is to use `none`, which means that we perform no aggregation. + +Here is an example using `simple` aggregation strategy. + +```python +from deepsparse import Pipeline +model_path = "zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/pruned90_quant-none" +pipeline = Pipeline.create( + task="token_classification", + model_path=model_path, + aggregation_strategy="simple" +) + +output = pipeline("The Uzbekistani striker scored a goal in the final minute to defeat the Italian national team") +print(output.predictions) + +# [[TokenClassificationResult(entity='MISC', score=0.9935868382453918, word='uzbekistani', start=4, end=15, index=None, is_grouped=True), +# TokenClassificationResult(entity='MISC', score=0.9991180896759033, word='italian', start=72, end=79, index=None, is_grouped=True)]] +``` + +In comparison, here is the standard output with no aggregation: + +```python +from deepsparse import Pipeline +model_path = "zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/pruned90_quant-none" +pipeline = Pipeline.create( + task="token_classification", + model_path=model_path, + aggregation_strategy="none" +) + +output = pipeline("The Uzbekistani striker scored a goal in the final minute to defeat the Italian national team") +print(output.predictions) + +# [[[TokenClassificationResult(entity='B-MISC', score=0.9973152279853821, word='uzbekistan', start=4, end=14, index=2, is_grouped=False), +# TokenClassificationResult(entity='I-MISC', score=0.9898584485054016, word='##i', start=14, end=15, index=3, is_grouped=False), +# TokenClassificationResult(entity='B-MISC', score=0.9991180896759033, word='italian', start=72, end=79, index=15, is_grouped=False)]] +``` + +### Cross Use Case Functionality +Check out the [Pipeline User Guide](../../user-guide/deepsparse-pipelines.md) for more details on configuring a Pipeline. + +## DeepSparse Server + +DeepSparse Server is built on top of FastAPI and Uvicorn, enabling you to set up a REST endpoint for serving inferences over HTTP. Since DeepSparse Server wraps the Pipeline API, it inherits all the utilities provided by Pipelines. + +The CLI command below launches a token classification pipeline with a 90% pruned-quantized BERT model trained on Conll2003: + +```bash +deepsparse.server \ + --task token_classification \ + --model_path "zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/pruned90_quant-none" # or path/to/onnx +``` +You should see Uvicorn report that it is running on http://0.0.0.0:5543. Once launched, a /docs path is created with full endpoint descriptions and support for making sample requests. + +Here is an example client request, using the Python requests library for formatting the HTTP: +```python +import requests + +# Uvicorn is running on this port +url = 'http://0.0.0.0:5543/predict' +# send the data +obj = {"inputs": "Mary is flying from Nairobi to New York to attend a conference"} +resp = requests.post(url=url, json=obj) +# receive the post-processed output +print(resp.text) +# {"predictions":[[{"entity":"B-PER","score":0.9966245293617249,"word":"mary","start":0,"end":4,"index":1,"is_grouped":false},{"entity":"B-LOC","score":0.999544084072113,"word":"nairobi","start":20,"end":27,"index":5,"is_grouped":false},{"entity":"B-LOC","score":0.9993794560432434,"word":"new","start":31,"end":34,"index":7,"is_grouped":false},{"entity":"I-LOC","score":0.9970214366912842,"word":"york","start":35,"end":39,"index":8,"is_grouped":false}]]} +``` + +#### Use Case Specific Arguments +To use the `sequence_length` and `aggregation_strategy` arguments, create a server configuration file for passing the arguments via `kwargs`. + +This configuration file sets sequence length to 64 with `simple` aggregation strategy: +```yaml +# ner-config.yaml +endpoints: + - task: token_classification + model: zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/pruned90_quant-none + kwargs: + sequence_length: 64 # uses sequence length 64 + aggregation_strategy: simple +``` +Spin up the server: + +```bash +deepsparse.server \ + --config-file ner-config.yaml +``` +Making a request: +```python +import requests + +# Uvicorn is running on this port +url = 'http://0.0.0.0:5543/predict' + +# send the data +obj = {"inputs": "Mary is flying from Nairobi to New York to attend a conference",} +resp = requests.post(url=url, json=obj) + +# recieve the post-processed output +print(resp.text) +# {"predictions":[[{"entity":"PER","score":0.9966245293617249,"word":"mary","start":0,"end":4,"index":null,"is_grouped":true},{"entity":"LOC","score":0.999544084072113,"word":"nairobi","start":20,"end":27,"index":null,"is_grouped":true},{"entity":"LOC","score":0.9982004165649414,"word":"new york","start":31,"end":39,"index":null,"is_grouped":true}]]} +``` +## Using a Custom ONNX File +Apart from using models from the SparseZoo, DeepSparse allows you to deploy token classification pipelines with custom ONNX files. + +The first step is to obtain the ONNX model. You can obtain the file by converting your model to ONNX after training. +Click Download on the [oBERT page](https://sparsezoo.neuralmagic.com/models/nlp%2Ftoken_classification%2Fobert-base%2Fpytorch%2Fhuggingface%2Fconll2003%2Fpruned90_quant-none) +to download an ONNX oBERT model for demonstration. + +Extract the downloaded file and create a folder containing the following required files: +- `config.json` +- `tokenizer.json` +- `model.onnx` + +Use the folder as the model path to the token classification pipeline: +```python +from deepsparse import Pipeline +pipeline = Pipeline.create( + task="token_classification", + model_path="token_classification", + ) +output = pipeline("Mary is flying from Nairobi to New York") +print(output.predictions) +# [[TokenClassificationResult(entity='B-PER', score=0.9971914291381836, word='mary', start=0, end=4, index=1, is_grouped=False), TokenClassificationResult(entity='B-LOC', score=0.9993892312049866, word='nairobi', start=20, end=27, index=5, is_grouped=False), TokenClassificationResult(entity='B-LOC', score=0.9993736147880554, word='new', start=31, end=34, index=7, is_grouped=False), TokenClassificationResult(entity='I-LOC', score=0.997299075126648, word='york', start=35, end=39, index=8, is_grouped=False)]] +``` +### Cross Use Case Functionality + +Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring the Server. diff --git a/docs/use-cases/nlp/transformers-embedding-extraction.md b/docs/use-cases/nlp/transformers-embedding-extraction.md new file mode 100644 index 0000000000..f0cd974859 --- /dev/null +++ b/docs/use-cases/nlp/transformers-embedding-extraction.md @@ -0,0 +1,238 @@ + + +# Deploying Transformers Embedding Extraction Models with DeepSparse + +This page explains how to deploy a transformers embedding extraction Pipeline with DeepSparse. + +There are three interfaces for interacting with DeepSparse: +- **Engine** is the lowest-level API that enables you to compile a model and run inference on raw input tensors. + +- **Pipeline** is the default DeepSparse API. Similar to Hugging Face Pipelines, it wraps Engine with pre-processing +and post-processing steps, allowing you to make requests on raw data and receive post-processed predictions. + +- **Server** is a REST API wrapper around Pipelines built on [FastAPI](https://fastapi.tiangolo.com/) and [Uvicorn](https://www.uvicorn.org/). It enables you to start a model serving +endpoint running DeepSparse with a single CLI. + +For the embedding extraction case, we will walk through an example of Pipeline and Server. + +## Installation Requirements + +This use case requires the installation of [DeepSparse Server](../../user-guide/installation.md). + +Confirm your machine is compatible with our [hardware requirements](../../user-guide/hardware-support.md). + +## DeepSparse Pipelines + +Pipeline is the default interface for interacting with DeepSparse. + +Like Hugging Face Pipelines, DeepSparse Pipelines wrap pre- and post-processing around the inference performed by the Engine. This creates a clean API that allows you to pass raw text and images to DeepSparse and receive the post-processed predictions, making it easy to add DeepSparse to your application. + +We will use the `Pipeline.create()` constructor to create an instance of an embedding extraction Pipeline with a 80% pruned-quantized version of BERT trained on `wikipedia_bookcorpus`. We can then pass raw text to the `Pipeline` and receive the predictions. All of the pre-processing (such as tokenizing the input) is handled by the `Pipeline`. + +With Transformers, you can use `task=transformer_embedding_extraction` for some extra utilities associated with embedding extraction. + +The first utility is automatic embedding layer detection. If you set `emb_extraction_layer=-1` (the default), the Pipeline automatically detects the final transformer layer before the projection head and removes the projection head for you. + +The second utility is automatic dimensionality reduction. You can use the `extraction_strategy` to perform a reduction on the sequence dimension rather than returning an embedding for each token. The options are: + +- `per_token`: returns the embedding for each token in the sequence (default) +- `reduce_mean`: returns the average token of the sequence +- `reduce_max`: returns the max token of the sequence +- `cls_token`: returns the cls token from the sequence + +An example using automatic embedding layer detection looks like this: + +```python +from deepsparse import Pipeline + +bert_emb_pipeline = Pipeline.create( + task="transformers_embedding_extraction", + model_path="zoo:nlp/masked_language_modeling/obert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned80_quant-none-vnni", +# emb_extraction_layer=-1, # (default: detect last layer) +# extraction_strategy="per_token" # (default: concat embedding for each token) +) + +input_sequence = "The generalized embedding extraction Pipeline is the best!" +embedding = bert_emb_pipeline(input_sequence) +print(len(embedding.embeddings[0])) +# 98304 << = 768*128 = hidden_dim * sequence_length>> +``` + +An example returning the average embeddings of the tokens looks like this: +```python +from deepsparse import Pipeline + +bert_emb_pipeline = Pipeline.create( + task="transformers_embedding_extraction", + model_path="zoo:nlp/masked_language_modeling/obert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned80_quant-none-vnni", +# emb_extraction_layer=-1, # (default: detect last layer) + extraction_strategy="reduce_mean" +) + +input_sequence = "The generalized embedding extraction Pipeline is the best!" +embedding = bert_emb_pipeline(input_sequence) +print(len(embedding.embeddings[0])) +# 768 <<=hidden dim>> +``` + +### Use Case Specific Arguments +The Transformers Embedding Extraction Pipeline contains additional arguments for configuring a `Pipeline`. + +#### Sequence Length +The `sequence_length` argument adjusts the ONNX graph to handle a specific sequence length. In the DeepSparse Pipelines, the tokenizers pad the input. As such, using shorter sequence lengths will have better performance. + +The example below compiles the model and runs inference with sequence length of 64. +```python +from deepsparse import Pipeline + +bert_emb_pipeline = Pipeline.create( + task="transformers_embedding_extraction", + model_path="zoo:nlp/masked_language_modeling/obert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned80_quant-none-vnni", +# emb_extraction_layer=-1, # (default: detect last layer) + extraction_strategy="reduce_mean", + sequence_length = 64 +) + +input_sequence = "The transformers embedding extraction Pipeline is the best!" +embedding = bert_emb_pipeline(input_sequence) +print(len(embedding.embeddings[0])) +# 768 +``` + +Alternatively, you can pass a list of sequence lengths, creating a "bucketable" pipeline. Under the hood, the DeepSparse Pipeline will compile multiple versions of the engine (utilizing a shared scheduler) and direct inputs towards the smallest bucket into which it fits. + +The example below creates a bucket for smaller input lengths (16 tokens) and for larger input lengths (128 tokens). + +```python +from deepsparse import Pipeline + +bert_emb_pipeline = Pipeline.create( + task="transformers_embedding_extraction", + model_path="zoo:nlp/masked_language_modeling/obert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned80_quant-none-vnni", +# emb_extraction_layer=-1, # (default: detect last layer) + extraction_strategy="reduce_mean", + sequence_length = [16, 128] +) + +input_sequence = "The transformers embedding extraction Pipeline is the best!" +embedding = bert_emb_pipeline(input_sequence) +print(len(embedding.embeddings[0])) +# 768 +``` +### Cross Use Case Functionality +Check out the [Pipeline User Guide](../../user-guide/deepsparse-pipelines.md) for more details on configuring a Pipeline. + +## DeepSparse Server +Built on the popular FastAPI and Uvicorn stack, DeepSparse Server enables you to set-up a REST endpoint for serving inferences over HTTP. Since DeepSparse Server wraps the Pipeline API, it inherits all of the utilities provided by Pipelines. + +The CLI command below launches an embedding extraction pipeline with an 80% pruned-quantized BERT model identifed by its SparseZoo stub: + +```bash +deepsparse.server \ + --task transformers_embedding_extraction \ + --model_path "zoo:nlp/masked_language_modeling/obert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned80_quant-none-vnni" # or path/to/onnx +``` + +You should see Uvicorn report that it is running on `http://0.0.0.0:5543`. Once launched, a `/docs` path is +created with full endpoint descriptions and support for making sample requests. + +Here is an example client request, using the Python `requests` library for formatting the HTTP: +```python +import requests + +# Uvicorn is running on this port +url = 'http://0.0.0.0:5543/predict' + +# send the data +obj = {"inputs": "The transformers embedding extraction Pipeline is the best!"} +resp = requests.post(url=url, json=obj) + +# recieve the post-processed output +print(resp.text) +# >> {[[0.022315271198749542,0.02142658829689026, ... ,0.01950429379940033]]} +``` + +### Use Case Specific Arguments + +To use the `sequence_length` and `extraction_strategy` arguments, we can a Server configuration file, passing the arguments via `kwargs` + +This configuration file sets sequence length to 64 and returns all scores: + +```yaml +# config.yaml +endpoints: + - task: transformers_embedding_extraction + model: zoo:nlp/masked_language_modeling/obert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned80_quant-none-vnni + kwargs: + sequence_length: 64 # uses sequence length 64 + extraction_strategy: reduce_mean +``` + +Spin up the server: +```bash +deepsparse.server --config_file config.yaml +``` +Making requests: + +```python +import requests, json +# Uvicorn is running on this port +url = 'http://0.0.0.0:5543/predict' + +# send the data +obj = {"inputs": "The transformers embedding extraction Pipeline is the best!"} +resp = requests.post(url=url, json=obj) + +# recieve the post-processed output +print(resp.text) +# >> {[[0.022315271198749542,0.02142658829689026, ... ,0.01950429379940033]]} +resp = requests.post(url=url, json=obj) +result = json.loads(resp.text) +print(len(result["embeddings"][0])) +# 768 +``` +## Using a Custom ONNX File +Apart from using models from the SparseZoo, DeepSparse allows you to deploy transformer embedding extraction pipelines with custom ONNX files. + +The first step is to obtain the ONNX model. You can obtain the file by converting your model to ONNX after training. +Click Download on the [DistilBERT - wikipedia_bookcorpus page](https://sparsezoo.neuralmagic.com/models/nlp%2Fmasked_language_modeling%2Fdistilbert-none%2Fpytorch%2Fhuggingface%2Fwikipedia_bookcorpus%2Fpruned80_quant-none-vnni) +to download an ONNX DistilBERT model for demonstration. + +Extract the downloaded file and create a folder containing the following required files: +- `config.json` +- `tokenizer.json` +- `model.onnx` + +Use the folder as the model path to the transformer embedding extraction pipeline: +```python +from deepsparse import Pipeline + +bert_emb_pipeline = Pipeline.create( + task="transformers_embedding_extraction", + model_path="transformers_embedding_extraction", + emb_extraction_layer=-1, # (default: detect last layer) + extraction_strategy="per_token" # (default: concat embedding for each token) +) + +input_sequence = "The generalized embedding extraction Pipeline is the best!" +embedding = bert_emb_pipeline(input_sequence) +print(len(embedding.embeddings[0])) +# 98304 +``` +### Cross Use Case Functionality + +Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring the Server. diff --git a/docs/use-cases/nlp/zero-shot-text-classification.md b/docs/use-cases/nlp/zero-shot-text-classification.md new file mode 100644 index 0000000000..88ea6a94cb --- /dev/null +++ b/docs/use-cases/nlp/zero-shot-text-classification.md @@ -0,0 +1,283 @@ + + +# Deploying Zero Shot Text Classification Models + +This page explains how to benchmark and deploy a zero-shot text classification model with DeepSparse. + + +There are three interfaces for interacting with DeepSparse: +- **Engine** is the lowest-level API. It enables you to compile a model and run inference on raw input tensors. + +- **Pipeline** is the default DeepSparse API. Similar to Hugging Face Pipelines, it wraps Engine with pre-processing +and post-processing steps, allowing you to make requests on raw data and receive post-processed predictions. + +- **Server** is a REST API wrapper around Pipelines built on [FastAPI](https://fastapi.tiangolo.com/) and [Uvicorn](https://www.uvicorn.org/). It enables you to start a model serving +endpoint running DeepSparse with a single CLI. + +We will walk through an example of each. + +## Installation Requirements + +This use case requires the installation of [DeepSparse Server](../../user-guide/installation.md). + +Confirm your machine is compatible with our [hardware requirements](../../user-guide/hardware-support.md). + +## Task Overview + +Zero-shot text classification allows us to perform text classification over any set of potential labels without training a text classification model on those specific labels. + +We can accomplish this goal via two steps: +- Train a model to predict whether a given pair of sentences is an `entailment`, `neutral`, or `contradiction` (on a dataset like MNLI) +- For each sentence `S` and set of labels `L`, predict label `L_i` which has the highest entailment score between `S` and a hypothesis of the form `This text is related to {L_i}` as predicted by the model. + +## DeepSparse Pipelines + +Pipeline is the default interface for interacting with DeepSparse. + +Like Hugging Face Pipelines, DeepSparse Pipelines wrap pre- and post-processing around the inference performed by the Engine. This creates a clean API that allows you to pass raw text and images to DeepSparse and receive the post-processed predictions, making it easy to add DeepSparse to your application. + +We will use the `Pipeline.create()` constructor to create an instance of a zero-shot text classification Pipeline with a 90% pruned-quantized version of oBERT trained on MNLI. We can then pass raw text to the `Pipeline` and receive the predictions. All of the pre-processing (such as tokenizing the input and formatting the hypothesis) is handled by the `Pipeline`. + +Here's an example with a single input and single label prediction with a model trained on MNLI: + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none" +pipeline = Pipeline.create( + task="zero_shot_text_classification", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=1, # default batch size is 1 + labels=["poltics", "public health", "sports"] +) + +# run inference +prediction = pipeline("Who are you voting for in the upcoming election") +print(prediction) + +# sequences='Who are you voting for in the upcoming election' labels=['poltics', 'sports', 'public health'] scores=[0.5765101909637451, 0.23050746321678162, 0.19298239052295685] +``` + +We can also pass the labels at inference time: + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none" +pipeline = Pipeline.create( + task="zero_shot_text_classification", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=1, # default batch size is 1 +) + +# run inference +prediction = pipeline( + sequences="My favorite sports team is the Boston Red Sox", + labels=["sports", "politics", "public health"] +) +print(prediction) + +# sequences='My favorite sports team is the Boston Red Sox' labels=['sports', 'politics', 'public health'] scores=[0.9349604249000549, 0.048094600439071655, 0.016944952309131622] +``` + +### Use Case Specific Arguments +The Zero Shot Text Classification Pipeline contains additional arguments for configuring a `Pipeline`. + +#### Sequence Length +The `sequence_length` argument adjusts the ONNX graph to handle a specific sequence length. In the DeepSparse Pipelines, the tokenizers pad the input. As such, using shorter sequence lengths will have better performance. The default sequence length for text classification is 128. + +The example below runs the zero-shot text classification at sequence length 64. + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none" +pipeline = Pipeline.create( + task="zero_shot_text_classification", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=1, # default batch size is 1 + sequence_length=64 +) + +# run inference +prediction = pipeline( + sequences="My favorite sports team is the Boston Red Sox", + labels=["sports", "politics", "public health"] +) +print(prediction) + +# sequences='My favorite sports team is the Boston Red Sox' labels=['sports', 'politics', 'public health'] scores=[0.9349604249000549, 0.048094600439071655, 0.016944952309131622] +``` + +Alternatively, you can pass a list of sequence lengths, creating a "bucketable" pipeline. Under the DeepSparse Pipeline will compile multiple versions of the model (utilizing a shared scheduler) and direct inputs towards the smallest bucket into which an input fits. + +The example below creates a bucket for smaller input lengths (16 tokens) and for larger input lengths (128 tokens). +```python +from deepsparse import Pipeline, Context + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none" +pipeline = Pipeline.create( + task="zero_shot_text_classification", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=1, # default batch size is 1 + sequence_length=[32,128], + context=Context(num_streams=1) +) + +# run inference +prediction = pipeline( + sequences="My favorite sports team is the Boston Red Sox", + labels=["sports", "politics", "public health"] +) +print(prediction) + +# sequences='My favorite sports team is the Boston Red Sox' labels=['sports', 'politics', 'public health'] scores=[0.9349604249000549, 0.048094600439071655, 0.016944952309131622] +``` + +### Model Config + +Additionally, we can pass a `model_config` to specify the form of the hypothesis passed to DeepSparse as part of the zero shot text classification scheme. + +For instance, rather than running the comparison with `"This text is related to {}"`, we can instead use `"This text is similiar to {}"` with the following: + +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +sparsezoo_stub = "zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none" +pipeline = Pipeline.create( + task="zero_shot_text_classification", + model_path=sparsezoo_stub, # sparsezoo stub or path to local ONNX + batch_size=1, # default batch size is 1 + model_config={"hypothesis_template": "This text is similar to {}"} +) + +# run inference +prediction = pipeline( + sequences="My favorite sports team is the Boston Red Sox", + labels=["sports", "politics", "public health"] +) +print(prediction) + +# sequences='My favorite sports team is the Boston Red Sox' labels=['sports', 'politics', 'public health'] scores=[0.5861895680427551, 0.32133620977401733, 0.0924743041396141] +``` + +### Cross Use Case Functionality +Check out the [Pipeline User Guide](../../user-guide/deepsparse-pipelines.md) for more details on configuring a Pipeline. + +## DeepSparse Server +Built on the popular FastAPI and Uvicorn stack, DeepSparse Server enables you to set up a REST endpoint for serving inferences over HTTP. DeepSparse Server wraps the Pipeline API, so it inherits all the utilities provided by Pipelines. + +The CLI command below launches a zero shot text classification pipeline with a 90% pruned-quantized oBERT model trained on MNLI: + +```bash +deepsparse.server \ + --task zero_shot_text_classification \ + --model_path "zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none" # or path/to/onnx +``` + +Making a request: +```python +import requests + +# Uvicorn is running on this port +url = 'http://0.0.0.0:5543/predict' + +# send the data +obj = { + "sequences": ["The Boston Red Sox are my favorite baseball team!"], + "labels": ["sports", "politics", "public health"] +} +resp = requests.post(url=url, json=obj) + +# recieve the post-processed output +print(resp.text) +# {"sequences":["The Boston Red Sox are my favorite baseball team!"],"labels":[["sports","politics","public health"]],"scores":[[0.9649990200996399,0.028026442974805832,0.006974523887038231]]} +``` + +### Use Case Specific Arguments +To use the `labels` and `model_config` arguments in the server constructor, create a Server configuration file for passing the arguments via kwargs. + +This configuration file sets the labels to `sports`, `politics` and `public health` and creates hypotheses of the form `"This sentence is similiar to {}"`. + +```yaml +# config.yaml +endpoints: + - task: zero_shot_text_classification + model: zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none + kwargs: + labels: ["sports", "politics", "public health"] + model_config: {"hypothesis_template": "This text is similar to {}"} +``` + +Spin up the server: +```bash +deepsparse.server --config-file config.yaml +``` + +Making a request: +```python +import requests + +# Uvicorn is running on this port +url = 'http://0.0.0.0:5543/predict' + +# send the data +obj = {"sequences": ["The Boston Red Sox are my favorite baseball team!"]} +resp = requests.post(url=url, json=obj) + +# recieve the post-processed output +print(resp.text) +# {"sequences":["The Boston Red Sox are my favorite baseball team!"],"labels":[["sports","politics","public health"]],"scores":[[0.7818478941917419,0.17189143598079681,0.04626065865159035]]} + +``` +## Using a Custom ONNX File +Apart from using models from the SparseZoo, DeepSparse allows you to deploy zero-shot text classification pipelines with custom ONNX files. + +The first step is to obtain the ONNX model. You can obtain the file by converting your model to ONNX after training. +Click Download on the [BERT base uncased page](https://sparsezoo.neuralmagic.com/models/nlp%2Ftext_classification%2Fbert-base%2Fpytorch%2Fhuggingface%2Fsst2%2Fbase-none) +to download an ONNX BERT base uncased model for demonstration. + +Extract the downloaded file and create a folder containing the following required files: +- `config.json` +- `tokenizer.json` +- `model.onnx` + +Use the folder as the model path to the zero-shot text classification pipeline: +```python +from deepsparse import Pipeline + +# download onnx from sparsezoo and compile with batch size 1 +pipeline = Pipeline.create( + task="zero_shot_text_classification", + model_path="text-classification", # sparsezoo stub or path to local ONNX + batch_size=1, # default batch size is 1 + labels=["poltics", "public health", "sports"] +) + +# run inference +prediction = pipeline("Who are you voting for in the upcoming election") +print(prediction) +# sequences='Who are you voting for in the upcoming election' labels=['sports', 'poltics', 'public health'] scores=[0.35093653202056885, 0.3335352838039398, 0.31552815437316895] +``` +### Cross Use Case Functionality + +Check out the [Server User Guide](../../user-guide/deepsparse-server.md) for more details on configuring the Server. diff --git a/docs/user-guide/README.md b/docs/user-guide/README.md new file mode 100644 index 0000000000..4ea16d42b1 --- /dev/null +++ b/docs/user-guide/README.md @@ -0,0 +1,215 @@ + + +# User Guide + +This directory demonstrates usage of DeepSparse's key API, including: +- [Benchmarking CLI](#performance-benchmarking) +- [Engine API](#engine) +- [Pipeline API](#pipeline) +- [Server API](#server) + +## Installation Requirements + +This page requires [DeepSparse Server installation](installation.md). + +## Performance Benchmarking + +DeepSparse's key feature is its performance on commodity CPUs. For dense unoptimized models, DeepSparse is competitive with other CPU runtimes like ONNX Runtime. However, when optimization techniques like pruning and quantization are applied to a model, DeepSparse can achieve an order-of-magnitude speedup. + +As an example, let's compare DeepSparse and ORT's performance on BERT using a [90% pruned-quantized version](https://sparsezoo.neuralmagic.com/models/nlp%2Fsentiment_analysis%2Fobert-base%2Fpytorch%2Fhuggingface%2Fsst2%2Fpruned90_quant-none) in SparseZoo on an AWS `c6i.16xlarge` instance (32 cores). + +ORT achieves 18.5 items/second running BERT (make sure you have ORT installed `pip install onnxruntime`): +```bash +deepsparse.benchmark zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/base-none -b 64 -s sync -nstreams 1 -i [64,384] -e onnxruntime + +>> Original Model Path: zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/base-none +>> Batch Size: 64 +>> Scenario: sync +>> Throughput (items/sec): 18.5742 +``` + +DeepSparse achieves 226 items/second running the pruned-quantized version of BERT: + +```bash +deepsparse.benchmark zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none -b 64 -s sync -nstreams 1 -i [64,384] + +>> Original Model Path: zoo:nlp/text_classification/obert-base/pytorch/huggingface/mnli/pruned90_quant-none +>> Batch Size: 64 +>> Scenario: sync +>> Throughput (items/sec): 226.6340 +``` + +DeepSparse achieves a ***12x speedup*** over ORT! + +**Pro-Tip:** In place of a [SparseZoo](https://sparsezoo.neuralmagic.com/) stubs, you can pass a local ONNX file to test your model. + +Checkout the [Performance Benchmarking guide](deepsparse-benchmarking.md) for more details. + +## Deployment APIs + +Now that we have seen DeepSparse's performance gains, we can add DeepSparse to an application. + +DeepSparse includes three deployment APIs: +- Engine is the lowest-level API. With Engine, you pass tensors and receive the raw logits. +- Pipeline wraps the Engine with pre- and post-processing. With Pipeline, you pass raw data and +receive the prediction. +- Server wraps Pipelines with a REST API using FastAPI. With Server, you send raw data over HTTP +and receive the prediction. + +The following are simple examples of each API to get a sense of how it is used. For the example, we will use +the sentiment analysis use-case with a 90% pruned-quantized version of BERT. + +### Engine + +Engine is the lowest-level API, allowing you to run inference directly on input tensors. + +The example below downloads a 90% pruned-quantized BERT model for sentiment analysis +in ONNX format from SparseZoo, compiles the model, and runs inference on randomly generated input. + +```python +from deepsparse import compile_model +from deepsparse.utils import generate_random_inputs, model_to_path + +# download onnx, compile model +zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" +batch_size = 1 +bert_model = compile_model( + model=zoo_stub, # sparsezoo stub or path/to/local/model.onnx + batch_size=batch_size # default is batch 1 +) + +# run inference (input is raw numpy tensors, output is raw scores) +inputs = generate_random_inputs(model_to_path(zoo_stub), batch_size) +output = bert_model(inputs) +print(output) + +# > [array([[-0.3380675 , 0.09602544]], dtype=float32)] << raw scores +``` + +#### Model Format + +DeepSparse can accept ONNX models from two sources: + +- **SparseZoo Stubs**: SparseZoo is Neural Magic's open-source repository of sparse models. You can pass a SparseZoo stub, a unique identifier for +each model to DeepSparse, which downloads the necessary ONNX files from the remote repository. + +- **Custom ONNX**: DeepSparse allows you to use your own model in ONNX format. Checkout the SparseML user guide for more details on exporting +your sparse models to ONNX format. Here's a quick example using a custom ONNX file from the ONNX model zoo: + +```bash +wget https://github.com/onnx/models/raw/main/vision/classification/mobilenet/model/mobilenetv2-7.onnx +> Saving to: ‘mobilenetv2-7.onnx’ +``` + +```python +from deepsparse import compile_model +from deepsparse.utils import generate_random_inputs +onnx_filepath = "mobilenetv2-7.onnx" +batch_size = 1 + +# Generate random sample input +inputs = generate_random_inputs(onnx_filepath, batch_size) + +# Compile and run +engine = compile_model(onnx_filepath, batch_size) +outputs = engine.run(inputs) +``` + +### Pipeline + +Pipeline is the default API for interacting with DeepSparse. Similar to Hugging Face Pipelines, +DeepSparse Pipelines wrap Engine with pre- and post-processing (as well as other utilities), +enabling you to send raw data to DeepSparse and receive the post-processed prediction. + +The example below downloads a 90% pruned-quantized BERT model for sentiment analysis +in ONNX format from SparseZoo, sets up a pipeline, and runs inference on sample data. + +```python +from deepsparse import Pipeline + +# download onnx, set up pipeline +zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" +batch_size = 1 +sentiment_analysis_pipeline = Pipeline.create( + task="sentiment-analysis", # name of the task + model_path=zoo_stub, # zoo stub or path to local onnx file + batch_size=batch_size # default is batch 1 +) + +# run inference (input is a sentence, output is the prediction) +prediction = sentiment_analysis_pipeline("I love using DeepSparse Pipelines") +print(prediction) +# > labels=['positive'] scores=[0.9954759478569031] +``` + +Checkout the [DeepSparse Pipeline guide](deepsparse-pipelines.md) for more details. + +### Server + +Server wraps Pipelines with REST APIs, that make it easy to stand up a model serving endpoint +running DeepSparse. This enables you to send raw data to DeepSparse over HTTP and receive the post-processed +predictions. + +DeepSparse Server is launched from the command line, configured via arguments or a server configuration file. + +The following downloads a 90% pruned-quantized BERT model for sentiment analysis in ONNX format +from SparseZoo and launches a sentiment analysis endpoint: + +```bash +deepsparse.server \ + --task sentiment-analysis \ + --model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none +``` + +Alternatively, the following configuration file can launch the Server. + +```yaml +# config.yaml +endpoints: + - task: sentiment-analysis + route: /predict + model: zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none +``` + +Spinning up: +```bash +deepsparse.server \ + --config-file config.yaml +``` + +You should see Uvicorn report that it is running on port 5543. Navigating to the `/docs` endpoint will +show the exposed routes as well as sample requests. + +We can then send a request over HTTP. In this example, we will use the Python requests package +to format the HTTP. + +```python +import requests + +url = "http://localhost:5543/predict" # Server's port default to 5543 +obj = {"sequences": "Snorlax loves my Tesla!"} + +response = requests.post(url, json=obj) +print(response.text) +# {"labels":["positive"],"scores":[0.9965094327926636]} +``` + +Checkout the [DeepSparse Server guide](deepsparse-server.md) for more details. + +## Supported Tasks + +DeepSparse supports many CV and NLP use cases out of the box. Check out the [Use Cases page](../use-cases) for details on the task-specific APIs. diff --git a/docs/user-guide/deepsparse-benchmarking.md b/docs/user-guide/deepsparse-benchmarking.md new file mode 100644 index 0000000000..c4ab15b57b --- /dev/null +++ b/docs/user-guide/deepsparse-benchmarking.md @@ -0,0 +1,195 @@ + + +# DeepSparse Benchmarking + +This page explains how to use DeepSparse's CLI utilties for benchmarking performance in a variety of scenarios. + +## Installation Requirements + +Install DeepSparse with `pip`: + +```bash +pip install deepsparse[onnxruntime] +``` + +The benchmarking numbers were achieved on an AWS `c6i.16xlarge` (32 core) instance. + +## Quickstart + +Let's compare DeepSparse's performance with dense and sparse models. + +Run the following to benchmark DeepSparse with a [dense, unoptimized BERT ONNX model](https://sparsezoo.neuralmagic.com/models/nlp%2Fsentiment_analysis%2Fobert-base%2Fpytorch%2Fhuggingface%2Fsst2%2Fbase-none): + +```bash +deepsparse.benchmark zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/base-none --batch_size 64 + +>> INFO:deepsparse.benchmark.benchmark_model:Starting 'singlestream' performance measurements for 10 seconds +>> Original Model Path: zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/base-none +>> Batch Size: 64 +>> Scenario: sync +>> Throughput (items/sec): 102.1683 +``` + +Run the following to benchmark DeepSparse with a [90% pruned and quantized BERT ONNX model](https://sparsezoo.neuralmagic.com/models/nlp%2Fsentiment_analysis%2Fobert-base%2Fpytorch%2Fhuggingface%2Fsst2%2Fpruned90_quant-none): + +```bash +deepsparse.benchmark zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none --batch_size 64 + +>> INFO:deepsparse.benchmark.benchmark_model:Starting 'singlestream' performance measurements for 10 seconds +>> Original Model Path: zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none +>> Batch Size: 64 +>> Scenario: sync +>> Throughput (items/sec): 889.1262 +``` + +Running the sparse model, DeepSparse achieves 889 items/second vs 102 items/second with the dense model. **This is an 8.7x speedup!** + +### Comparing to ONNX Runtime + +The benchmarking utility also allows you to use ONNX Runtime as the inference runtime by passing `--engine onnxruntime`. + +Run the following to benchmark ORT with the same dense, unoptimized BERT ONNX model as above: +```bash +deepsparse.benchmark zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/base-none --batch_size 64 --engine onnxruntime + +>> INFO:deepsparse.benchmark.benchmark_model:Starting 'singlestream' performance measurements for 10 seconds +>> Original Model Path: zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/base-none +>> Batch Size: 64 +>> Scenario: sync +>> Throughput (items/sec): 64.3392 +``` + +Run the following to benchmark ORT with the same 90% pruned and quantized BERT ONNX model as above: +```bash +deepsparse.benchmark zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none --batch_size 64 --engine onnxruntime + +>> INFO:deepsparse.benchmark.benchmark_model:Starting 'singlestream' performance measurements for 10 seconds +>> Original Model Path: zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/base-none +>> Batch Size: 64 +>> Scenario: sync +>> Throughput (items/sec): 55.1905 +``` + +We can see that ORT does not gain additional performance from sparsity like DeepSparse. Additionally, DeepSparse runs the dense model +faster than ORT at high batch sizes. All in all, in this example **DeepSparse is 13.8x faster than ONNX Runtime**! + +## Usage + +Run `deepsparse.benchmark -h` to see full command line arguments. + +These are a few examples of common functionality. + +### Pass Your Local ONNX Model + +Beyond passing SparseZoo stubs, you can also pass a local path to an ONNX file to DeepSparse. As an example, download an ONNX file from SparseZoo using the CLI to a local directory called `./yolov5-download`. + +```bash +sparsezoo.download zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned85_quant-none --save-dir yolov5-download +``` + +We can pass a local ONNX file as follows: +```bash +deepsparse.benchmark yolov5-download/model.onnx +>> Original Model Path: yolov5-download/model.onnx +>> Batch Size: 1 +>> Scenario: sync +>> Throughput (items/sec): 219.7396 +``` + +### Batch Sizes + +We can adjust the batch size of the inference with `-b` or `--batch_size`. + +The following runs a 95% pruned-quantized version of ResNet-50 at batch size 1: +```bash +deepsparse.benchmark zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none --batch_size 1 + +>> Original Model Path: zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none +>> Batch Size: 1 +>> Scenario: sync +>> Throughput (items/sec): 852.7742 +``` + +The following runs a 95% pruned-quantized version of ResNet-50 at batch size 64: +```bash +deepsparse.benchmark zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none --batch_size 64 + +>> Original Model Path: zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none +>> Batch Size: 64 +>> Scenario: sync +>> Throughput (items/sec): 2456.9958 +``` + +In general, DeepSparse is able to achieve better performance at higher batch sizes, especially on many core machines as it is better able to utilitize the underlying hardware and saturate all of the cores at high batch sizes. + +### Custom Input Shape + +We can adjust the input share of the inference with `-i` or `--input_shape`. This is generally useful for changing the size of input images or sequence length for NLP. + +Here's an example doing a BERT inference with sequence length 384 (vs 128 as above): + +```bash +deepsparse.benchmark zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none --input_shape [1,384] + +>> Original Model Path: zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none +>> Batch Size: 1 +>> Scenario: sync +>> Throughput (items/sec): 121.7578 +``` + +Here's an example doing a YOLOv5s inference with a 320x320 image (rather than 640x640 as above): + +```bash +deepsparse.benchmark zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned85_quant-none -i [1,3,320,320] + +>> Original Model Path: zoo:cv/detection/yolov5-s/pytorch/ultralytics/coco/pruned85_quant-none +>> Batch Size: 1 +>> Scenario: sync +>> Throughput (items/sec): 615.7185 +``` + +### Inference Scenarios + +The default scenrio is synchronous inference. Set by the `--scenario sync` argument, the goal metric is latency per batch (ms/batch). This scenario submits a single inference request at a time to the engine, recording the time taken for a request to return an output. This mimics an edge deployment scenario. + +Additionally, DeepSparse offers asynchronous inference, where DeepSparse will allocate resources to handle multiple inferences at once. Set by the `--scenario async` argument. This scenario submits `--num_streams` concurrent inference requests to the engine. This mimics a model server deployment scenario. + +Here's an example handling 8 concurrent batch 1 inferences: + +```bash +deepsparse.benchmark zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none --scenario async --num_streams 8 + +>> Original Model Path: zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none +>> Batch Size: 1 +>> Scenario: async +>> Throughput (items/sec): 807.3410 +>> Latency Mean (ms/batch): 9.8906 +``` + +Here's an example handling one batch 1 inference at a time with the same model: + +```bash +deepsparse.benchmark zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none --scenario sync + +>> Original Model Path: zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none +>> Batch Size: 1 +>> Scenario: sync +>> Throughput (items/sec): 269.6001 +>> Latency Mean (ms/batch): 3.7041 +``` + +We can see that the async scenario achieves higher throughput, while the synchronous scenario achieves lower latency. Especially for very high core counts, using the asynchronous scheduler is a great way to improve performance if running at low batch sizes. diff --git a/docs/user-guide/deepsparse-pipelines.md b/docs/user-guide/deepsparse-pipelines.md new file mode 100644 index 0000000000..5d1e7fa92b --- /dev/null +++ b/docs/user-guide/deepsparse-pipelines.md @@ -0,0 +1,333 @@ + + +# DeepSparse Pipelines + +Pipelines are the default API for deploying a model with DeepSparse. + +Similar to Hugging Face Pipelines, DeepSparse Pipelines wrap inference with task-specific +pre- and post-processing, enabling you to pass raw data and receive the predictions. + +## Quickstart + +Let us try a quick example of the Pipeline API. All we have to do is pass a task and model to the +the `Pipeline.create` function, and then we can run inference on raw data using DeepSparse! + +This example creates a sentiment analysis Pipeline with a 90% pruned-quantized verion of BERT +from the SparseZoo. + +```python +from deepsparse import Pipeline + +# download and compile onnx, create pipeline +zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" +sentiment_analysis_pipeline = Pipeline.create( + task="sentiment-analysis", # name of the task + model_path=zoo_stub, # zoo stub or path to local onnx file +) + +# run inference +print(sentiment_analysis_pipeline("I love using DeepSparse Pipelines")) +# >>> labels=['positive'] scores=[0.9954759478569031] +``` + +In this case we passed a SparseZoo stub as the model, which instructs DeepSparse to download the +relevant ONNX file from the SparseZoo. To deploy your own model, pass a path to a `model.onnx` file or to a +folder containing `model.onnx` and supporting files (e.g., the Hugging Face `tokenizer.json` and `config.json`). + +## Supported Use Cases + +Pipelines support many CV and NLP use cases out of the box. [Check out the Use Cases page for details on task-specific APIs](../use-cases). + +## Custom Use Case + +Beyond officially supported tasks, Pipelines can be extended to additional tasks via the `CustomTaskPipeline`. + +`CustomTaskPipelines` are passed the following arguments: +- `model_path` - a SparseZoo stub or path to a local ONNX file +- `process_inputs_fn` - an optional function that handles pre-processing of input into a list +of numpy arrays that can be passed directly to the inference forward pass +- `process_outputs_fn` - an optional function that handles post-processing of the list of numpy arrays +that are the output of the engine forward pass + +To replicate the functionality of the image classification +pipeline as a custom Pipeline, an example is provided. + +Download an image and ONNX file (a 95% pruned-quantized ResNet-50) for the demo: +``` +wget https://raw.githubusercontent.com/neuralmagic/docs/main/files-for-examples/use-cases/embedding-extraction/goldfish.jpg +sparsezoo.download zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95_quant-none --save-dir ./resnet-50-pruned-quant +``` + +We can create a custom image classification Pipeline which returns the raw logits and class probabilities +for the 1000 ImageNet classes with the following: + +For the purposes of this quick example, make sure you have `torch` `torchvision` and `Pillow` installed. +```python +from deepsparse.pipelines.custom_pipeline import CustomTaskPipeline +from torchvision import transforms +from PIL import Image +import numpy as np +import torch + +IMAGENET_RGB_MEANS = [0.485, 0.456, 0.406] +IMAGENET_RGB_STDS = [0.229, 0.224, 0.225] +preprocess_transforms = transforms.Compose([ + transforms.Resize(256), + transforms.CenterCrop(224), + transforms.ToTensor(), + transforms.Normalize(mean=IMAGENET_RGB_MEANS, std=IMAGENET_RGB_STDS), +]) + +def preprocess(img_file): + with open(img_file, "rb") as img_file: + img = Image.open(img_file) + img = img.convert("RGB") + img = preprocess_transforms(img) + batch = torch.stack([img]) + return [batch.numpy()] + +custom_pipeline = CustomTaskPipeline( + model_path="./resnet-50-pruned-quant/model.onnx", + process_inputs_fn=preprocess, +) + +scores, probs = custom_pipeline("goldfish.jpg") + +print(scores.shape) +print(probs.shape) +print(np.sum(probs)) +print(np.argmax(probs)) + +# >> (1,1000) +# >> (1,1000) +# >> ~1.00000 +# >> 1 << index of the goldfish class in ImageNet +``` + +## Pipeline Utilities + +Beyond supporting pre- and post-processing, Pipelines also offer additional utilities that simplify +the deployment process. + +### Batch Size + +The `batch_size` argument configures the batch size of the Pipeline, modifying the underlying ONNX graph for you. +The default is batch size 1, and but we can override to batch size 3 with the following: + +```python +from deepsparse import Pipeline + +zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" +batch_size = 3 +sentiment_analysis_pipeline = Pipeline.create( + task="sentiment-analysis", # name of the task + model_path=zoo_stub, # zoo stub or path to local onnx file + batch_size=batch_size # default is batch 1 +) + +sentences = [ + "I love DeepSparse Pipelines", + "I hated changing the batch size with my prior Deep Learning framework", + "DeepSparse makes it very easy to adjust the batch size" +] +output = sentiment_analysis_pipeline(sentences) +print(output) + +# >>> labels=['positive', 'negative', 'positive'] scores=[0.9969560503959656, 0.9964107871055603, 0.7127435207366943] +``` + +### Number of Cores + +The `num_cores` argument configures the number of physical cores used by DeepSparse. The default is None, which +instructs DeepSparse to use all physical cores available on the system. We can override to use only +one core with the following: + +```python +from deepsparse import Pipeline + +zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" +sentiment_analysis_pipeline = Pipeline.create( + task="sentiment-analysis", # name of the task + model_path=zoo_stub, # zoo stub or path to local onnx file + num_cores=1 +) + +sentences = "I love how DeepSparse makes it easy to configure the number of cores used" + +output = sentiment_analysis_pipeline(sentences) +print(output) + +# >> labels=['positive'] scores=[0.9951152801513672] << but runs slower than if using all cores +``` + +### Dynamic Batch Size + +We can utilize an the multi-stream capabilites of DeepSparse to make requests with dynamic batch sizes. + +Let us create an example with a single sentiment analysis Pipeline with dynamic batch sizes by +setting the `batch_size` argument to None. Under the hood, the pipeline will split the batch into +multiple asynchronous requests using the multi-stream scheduler. + +```python +from deepsparse import Pipeline + +zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" +sentiment_analysis_pipeline = Pipeline.create( + task="sentiment_analysis", + model_path=zoo_stub, + batch_size=None, # setting to None enables dynamic batch +) + +b1_request = ["This multi model concept is great!"] +b4_request = b1_request * 4 + +output_b1 = sentiment_analysis_pipeline(b1_request) +output_b4 = sentiment_analysis_pipeline(b4_request) + +print(output_b1) +# >> labels=['positive'] scores=[0.9995297789573669] + +print(output_b4) +# >> labels=['positive', 'positive', 'positive', 'positive'] scores=[0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669] +``` + +### Deploy Multiple Models on the Same System + +Some deployment scenarios will require running multiple instances of DeepSparse on a single +machine. DeepSparse includes a concept called Context. Contexts can be used to run multiple +models with the same scheduler, enabling DeepSparse to manage the resources of the system effectively, +keeping engines that are running different models from fighting over resources. + +To create an example with multiple sentiment analysis Pipelines, one with batch size 1 (for maximum latency) +and one with batch size 32 (for maximum throughput): + +```python +from concurrent.futures import ThreadPoolExecutor +from deepsparse.engine import Context +from deepsparse import Pipeline + +context = Context() +executor = ThreadPoolExecutor(max_workers=context.num_streams) + +zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" + +sentiment_analysis_pipeline_b1 = Pipeline.create( + task="sentiment_analysis", + model_path=zoo_stub, + batch_size=1, + context=context, + executor=executor +) + +sentiment_analysis_pipeline_b32 = Pipeline.create( + task="sentiment_analysis", + model_path=zoo_stub, + batch_size=32, + context=context, + executor=executor +) + +b1_request = ["This multi model concept is great!"] +b32_request = b1_request * 32 + +output_b1 = sentiment_analysis_pipeline_b1(b1_request) +output_b32 = sentiment_analysis_pipeline_b32(b32_request) + +print(output_b1) +print(output_b32) + +# >> labels=['positive'] scores=[0.9995297789573669] +# >> labels=['positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive'] scores=[0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669, 0.9995297789573669] +``` + +If you are deploying multiple models on a same system, you may want to answer multiple +requests concurrently. We can enable this but setting the `num_streams` argument in the Context argument. + +```python +from concurrent.futures import ThreadPoolExecutor +from deepsparse.engine import Context +from deepsparse.pipeline import Pipeline +import threading + +class ExecutorThread(threading.Thread): + def __init__(self, pipeline, input, iters=1): + super(ExecutorThread, self).__init__() + self.pipeline = pipeline + self.input = input + self.iters = iters + + def run(self): + for _ in range(self.iters): + output = self.pipeline(self.input) + print(output) + +num_concurrent_requests = 2 + +context = Context(num_streams=num_concurrent_requests) +executor = ThreadPoolExecutor(max_workers=context.num_streams) + +zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none" + +sentiment_analysis_pipeline_b1 = Pipeline.create( + task="sentiment_analysis", + model_path=zoo_stub, + batch_size=1, + context=context, + executor=executor +) + +sentiment_analysis_pipeline_b32 = Pipeline.create( + task="sentiment_analysis", + model_path=zoo_stub, + batch_size=32, + context=context, + executor=executor +) + +b1_request = ["This multi model concept is great!"] +b64_request = b1_request * 32 + +threads = [ + ExecutorThread(sentiment_analysis_pipeline_b1, input=b1_request, iters=64), + ExecutorThread(sentiment_analysis_pipeline_b32, input=b64_request, iters=1), +] + +for thread in threads: + thread.start() + +for thread in threads: + thread.join() + +# mutiple b=1 queries print results before the b=32 query returns +``` + +Note that requests will be execute in a FIFO manner, with a maximum of `num_concurrent_requests` running at once. +As a result, high traffic on one of your Pipelines can impact performance on the other Pipeline. If you prefer to +isolate your Pipelines, we recommend using an orchestration framework such as Docker and Kubernetes with +one DeepSparse Pipeline running in each container for proper process isolation. + +### Multi-Stream Scheduling + +Stay tuned for documentation on enabling multi-stream scheduling with DeepSparse Pipelines. + +### Bucketing + +Stay tuned for documentation on enabling bucketing with DeepSparse Pipelines. + +### Logging + +Stay tuned for documentation on enabling logging with DeepSparse Pipelines. diff --git a/docs/user-guide/deepsparse-server.md b/docs/user-guide/deepsparse-server.md new file mode 100644 index 0000000000..63147a442e --- /dev/null +++ b/docs/user-guide/deepsparse-server.md @@ -0,0 +1,287 @@ + + +# DeepSparse Server + +DeepSparse Server wraps [Pipelines](deepsparse-pipelines.md) with a REST API, making it easy to stand up a inference +serving endpoint running DeepSparse. + +## Quickstart + +DeepSparse Server is launched from the CLI. Just like DeepSparse Pipelines, all we +have to do is pass a task and a model. + +Spin up sentiment analysis endpoint with a 90% pruned-quantized BERT model: +```bash +deepsparse.server \ + --task sentiment-analysis \ + --model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none +``` + +In this case, we used a SparseZoo stub, which instructs the Server to download the relevant +ONNX file from the SparseZoo. To deploy your own model, pass a path to a `model.onnx` file or a +folder containing the `model.onnx` and supporting files (e.g., the Hugging Face `tokenizer.json` and `config.json`). + +Let's make a request over HTTP. Since the Server is a wrapper around Pipelines, +we can send raw data to the endpoint and receive the post-processed predictions: + +```python +import requests +url = "http://localhost:5543/predict" +obj = {"sequences": "I love querying DeepSparse over HTTP!"} +print(requests.post(url, json=obj).text) + +# >>> {"labels":["positive"],"scores":[0.9909943342208862]} +``` + +For full usage, run: +```bash +deepsparse.server --help +``` + +## Supported Use Cases + +DeepSparse Server supports all tasks available in Pipelines. [Check out the Use Cases page for more details on task-specific APIs](../use-cases). + +## Swagger UI + +FastAPI's Swagger UI enables you to view your Server's routes and to make sample requests. Navigate to the `/docs` +route (e.g., `http://localhost:5543/docs`) to try it out. + +

+ +

+ +## Server Configuration + +You can configure DeepSparse Server via YAML files. + +### Basic Example + +Let us walk through a basic example of deploying via a configuration file. + +The following creates an endpoint running a 90% pruned-quantized version of +BERT trained on the SST2 dataset for the sentiment analysis task. + +```yaml +# config.yaml +endpoints: + - task: sentiment-analysis + model: zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none +``` + +We can then spin up with the `--config-file` argument: + +```bash +deepsparse.server \ + --config-file config.yaml +``` + +Sending a request: +```python +import requests +url = "http://localhost:5543/predict" +obj = {"sequences": "I love querying DeepSparse launched from a config file!"} +print(requests.post(url, json=obj).text) + +# >>> {"labels":["positive"],"scores":[0.9136188626289368]} +``` + +### Server Level Options + +At the server level, there are a few arguments that can be toggled. + +#### Physical Resources +`num_cores` specifies the number of cores that DeepSparse runs on. By default, +DeepSparse runs on all available cores. + +#### Scheduler +`num_workers` configures DeepSparse's scheduler. + +If `num_workers = 1` (the default), DeepSparse uses its "synchronous" scheduler, which allocates as many resources as possible +to each request. This format is optimizes per-request latency. By setting `num_workers > 1`, DeepSparse +utilizes its multi-stream scheduler, which processes multiple requests at the same time. +In deployment scenarios with low batch sizes and high core counts, using the "multi-stream" scheduler +can increase throughput by allowing DeepSparse to better saturate the cores. + +The following configuration creates a Server with DeepSparse running on two cores, with two input streams, +DeepSparse threads pinned to cores, and PyTorch provided with 2 threads. + +```yaml +# server-level-options-config.yaml +num_cores: 2 +num_workers: 2 + +endpoints: + - task: sentiment-analysis + model: zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none +``` + +We can also adjust the port by providing the `--port` argument. + +Spinning up: +```bash +deepsparse.server \ + --config-file server-level-options-config.yaml \ + --port 5555 +``` + +We can then query the Server with the same pattern, querying on port 5555: +```python +import requests +url = "http://localhost:5555/predict" +obj = {"sequences": "I love querying DeepSparse launched from a config file!"} +print(requests.post(url, json=obj).text) + +# >>> {"labels":["positive"],"scores":[0.9136188626289368]} +``` + +### Multiple Endpoints + +To serve multiple models from the same context, we can add an additional endpoint +to the server configuration file. + +Here is an example which stands up two sentiment analysis endpoints, one using a +dense unoptimized BERT and one using a 90% pruned-quantized BERT. + +```yaml +# multiple-endpoint-config.yaml +endpoints: + - task: sentiment-analysis + model: zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none + route: /sparse/predict + name: sparse-sentiment-analysis + + - task: sentiment-analysis + model: zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/base-none + route: /dense/predict + name: dense-sentiment-analysis +``` + +Spinning up: +```bash +deepsparse.server \ + --config-file multiple-endpoint-config.yaml +``` + +Making a request: +```python +import requests + +obj = {"sequences": "I love querying the multi-model server!"} + +sparse_url = "http://localhost:5543/sparse/predict" +print(f"From the sparse model: {requests.post(sparse_url, json=obj).text}") + +dense_url = "http://localhost:5543/dense/predict" +print(f"From the dense model: {requests.post(dense_url, json=obj).text}") + +# >>> From the sparse model: {"labels":["positive"],"scores":[0.9942120313644409]} +# >>> From the dense model: {"labels":["positive"],"scores":[0.998753547668457]} +``` + +### Endpoint Level Configuration + +We can also configure the properties of each endpoint, including task-specific +arguments from within the YAML file. + +For instance, the following configuration file creates two endpoints. + +The first is a text classification endpoint, using a 90% pruned-quantized BERT model trained on +IMDB for document classification (which means the model is tuned to classify long +sequence lengths). We configure this endpoint with batch size 1 and sequence length +of 512. Since sequence length is a task-specific argument used only in Transformers Pipelines, +we will pass this in `kwargs` in the YAML file. + +The second is a sentiment analysis endpoint. We will use the default +sequence length (128) with batch size 3. + +```yaml +# advanced-endpoint-config.yaml + +endpoints: + - task: text-classification + model: zoo:nlp/document_classification/obert-base/pytorch/huggingface/imdb/pruned90_quant-none + route: /text-classification/predict + name: text-classification + batch_size: 1 + kwargs: + sequence_length: 512 # uses 512 sequence len (transformers pipeline specific) + top_k: 2 # returns top 2 scores (text-classification pipeline specific arg) + + - task: sentiment-analysis + model: zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none + route: /sentiment-analysis/predict + name: sentiment-analysis + batch_size: 3 +``` + +Spinning up: +```bash +deepsparse.server \ + --config-file advanced-endpoint-config.yaml +``` + +Making requests: +```python +import requests + +# batch 1 +document_obj = {"sequences": "I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. \ + I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, \ + stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure \ + there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and \ + character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. \ + It's really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden \ + and predictable, often painful to watch. The makers of Earth KNOW it's rubbish as they have to always say 'Gene Roddenberry's Earth...' otherwise people \ + would not continue watching. Roddenberry's ashes must be turning in their orbit as this dull, cheap, poorly edited (watching it without advert breaks \ + really brings this home) trudging Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring him back as another actor. \ + Jeeez! Dallas all over again."} + +# batch 3 +short_obj = {"sequences": [ + "I love how easy it is to configure DeepSparse Server!", + "It was very challenging to configure my old deep learning inference platform", + "YAML is the best format for configuring my infrastructure" +]} + +document_classification_url = "http://localhost:5543/text-classification/predict" +print(requests.post(document_classification_url, json=document_obj).text) + +sentiment_analysis_url = "http://localhost:5543/sentiment-analysis/predict" +print(requests.post(sentiment_analysis_url, json=short_obj).text) + +# >>> {"labels":[["0","1"]],"scores":[[0.9994900226593018,0.0005100301350466907]]} +# >>> {"labels":["positive","negative","positive"],"scores":[0.9665533900260925,0.9952980279922485,0.9939143061637878]} +``` + +Check out the [Use Case](../use-cases) page for detailed documentation on task-specific arguments that can be applied to the Server via `kwargs`. + +## Custom Use Cases + +Stay tuned for documentation on using a custom DeepSparse Pipeline within the Server! + +## Multi-Stream + +Stay tuned for documentation on multi-stream scheduling with DeepSparse! + +## Logging + +Stay tuned for documentation on DeepSparse Logging! + +## Hot Reloading + +Stay tuned for documentation on Hot Reloading! diff --git a/docs/user-guide/hardware-support.md b/docs/user-guide/hardware-support.md new file mode 100644 index 0000000000..6602699e00 --- /dev/null +++ b/docs/user-guide/hardware-support.md @@ -0,0 +1,30 @@ + + +# Supported Hardware for DeepSparse + +With support for AVX2, AVX-512, and VNNI instruction sets, DeepSparse is validated to work on x86 Intel (Haswell generation and later) and AMD (Zen 2 and later) CPUs running Linux. +Mac and Windows require running Linux in a Docker or virtual machine. + +Here is a table detailing specific support for some algorithms over different microarchitectures: + +| x86 Extension | Microarchitectures | Kernel Sparsity | Sparse Quantization | +|:----------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------:|:-------------------:| +| [AMD AVX2](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX2) | [Zen 2,](https://en.wikipedia.org/wiki/Zen_2) [Zen 3](https://en.wikipedia.org/wiki/Zen_3) | optimized | emulated | +| [AMD AVX-512](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX-512) VNNI | [Zen 4](https://en.wikipedia.org/wiki/Zen_4) | optimized | optimized | +| [Intel AVX2](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX2) | [Haswell,](https://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29) [Broadwell,](https://en.wikipedia.org/wiki/Broadwell_%28microarchitecture%29) and newer | optimized | emulated | +| [Intel AVX-512](https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512) | [Skylake](https://en.wikipedia.org/wiki/Skylake_%28microarchitecture%29), [Cannon Lake](https://en.wikipedia.org/wiki/Cannon_Lake_%28microarchitecture%29), and newer | optimized | emulated | +| [Intel AVX-512](https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512) VNNI (DL Boost) | [Cascade Lake](https://en.wikipedia.org/wiki/Cascade_Lake_%28microarchitecture%29), [Ice Lake](https://en.wikipedia.org/wiki/Ice_Lake_%28microprocessor%29), [Cooper Lake](https://en.wikipedia.org/wiki/Cooper_Lake_%28microarchitecture%29), [Tiger Lake](https://en.wikipedia.org/wiki/Tiger_Lake_%28microprocessor%29) | optimized | optimized | diff --git a/docs/user-guide/installation.md b/docs/user-guide/installation.md new file mode 100644 index 0000000000..75b7346bb6 --- /dev/null +++ b/docs/user-guide/installation.md @@ -0,0 +1,47 @@ + + +# DeepSparse Installation + +DeepSparse is tested on Python 3.7-3.10, ONNX 1.5.0-1.10.1, ONNX opset version 11+ and is [manylinux compliant](https://peps.python.org/pep-0513/). + +It currently supports Intel and AMD AVX2, AVX-512, and VNNI x86 instruction sets. + +## General Install + +Use the following command to install DeepSparse with pip: + +```bash +pip install deepsparse +``` + +## Installing the Server + +DeepSparse Server allows you to serve models and pipelines through an HTTP interface using the `deepsparse.server` CLI. +To install, use the following extra option: + +```bash +pip install deepsparse[server] +``` + +## Installing YOLO + +The Ultralytics YOLOv5 models require extra dependencies for deployment. To use YOLO models, install with the following extra option: + +```bash +pip install deepsparse[yolo] # just yolo requirements +pip install deepsparse[yolo,server] # both yolo + server requirements +``` diff --git a/docs/user-guide/scheduler.md b/docs/user-guide/scheduler.md new file mode 100644 index 0000000000..a26c8aa282 --- /dev/null +++ b/docs/user-guide/scheduler.md @@ -0,0 +1,75 @@ + + +# Inference Types With DeepSparse Scheduler + +This page explains the various settings for DeepSparse, which enable you to tune the performance to your workload. + +Schedulers are special system software, which handle the distribution of work across cores in parallel computation. +The goal of a good scheduler is to ensure that, while work is available, cores are not sitting idle. +On the contrary, as long as parallel tasks are available, all cores should be kept busy. + +## Single Stream (Default) +In most use cases, the default scheduler is the preferred choice when running inferences with DeepSparse. +The default scheduler is highly optimized for minimum per-request latency, using all of the system's resources provided to it on every request it gets. +Often, particularly when working with large batch sizes, the scheduler is able to distribute the workload of a single request across as many cores as it's provided. + +*Single-stream scheduling; requests execute serially by default:* + + +single stream diagram + +## Multi-Stream + +There are circumstances in which more cores does not imply better performance. If the computation can't be divided up to produce enough parallelism (while maximizing use of the CPU cache), then adding more cores simply adds more compute power with little work to apply it to. + +An alternative, multi-stream scheduler is provided with the software. In cases where parallelism is low, sending multiple requests simultaneously can more adequately saturate the available cores. In other words, if speedup can't be achieved by adding more cores, then perhaps speedup can be achieved by adding more work. + +If increasing core count does not decrease latency, that's a strong indicator that parallelism is low in your particular model/batch-size combination. It may be that total throughput can be increased by making more requests simultaneously. Using the [deepsparse.engine.Scheduler API,](https://docs.neuralmagic.com/archive/deepsparse/api/deepsparse.html#module-deepsparse.engine) the multi-stream scheduler can be selected, and requests made by multiple Python threads will be handled concurrently. + +*Multi-stream scheduling; requests execute in parallel and may better utilize hardware resources:* + +multi stream diagram + + + +Whereas the default scheduler will queue up requests made simultaneously and handle them serially, the multi-stream scheduler allows multiple requests to be run in parallel. The `num_streams` argument to the Engine/Context classes controls how the multi-streams scheduler partitions up the machine. Each stream maps to a contiguous set of hardware threads. By default, only one hyperthread per core is used. There is no sharing amongst the partitions and it is generally good practice to make sure the `num_streams` value evenly divides into your number of cores. By default `num_streams` is set to multiplex requests across L3 caches. + +Here's an example. Consider a machine with 2 sockets, each with 8 cores. In this case, the multi-stream scheduler will create two streams, one per socket by default. The first stream will contain cores 0-7 and the second stream will contain cores 8-15. + +Manually increasing `num_streams` to 3 will result in the following stream breakdown: threads 0-5 in the first stream, 6-10 in the second, and 11-15 in the last. This is problematic for our 2-socket system. The second stream (threads 6-10) is straddling both sockets, meaning that each request being serviced by that stream is going to incur a performance penalty each time one of its threads makes a remote memory access. The impact of this penalty will depend on the workload, but it will likely be significant. + +Manually increasing `num_streams` to 4 is interesting. Here's the stream breakdown: threads 0-3 in the first stream, 4-7 in the second, 8-11 in the third, and 12-15 in the fourth. Each stream is only making memory accesses that are local to its socket, which is good. However, the first two and last two streams are sharing the same L3 cache, which can result in worse performance due to cache thrashing. Depending on the workload, though, the performance gain from the increased parallelism may negate this penalty. + +The most common use cases for the multi-stream scheduler are where parallelism is low with respect to core count, and where requests need to be made asynchronously without time to batch them. Implementing a model server may fit such a scenario and be ideal for using multi-stream scheduling. + +## Enabling a Scheduler + +Depending on your engine execution strategy, enable one of these options by running: + +```python +engine = compile_model(model_path, scheduler="single_stream") +``` + +or: + +```python +engine = compile_model(model_path, scheduler="multi_stream", num_streams=None) # None is the default +``` + +or pass in the enum value directly, since` "multi_stream" == Scheduler.multi_stream`. + +By default, the scheduler will map to a single stream. diff --git a/src/deepsparse/__init__.py b/src/deepsparse/__init__.py index 0c0e7e0d1b..7ac2b698ef 100644 --- a/src/deepsparse/__init__.py +++ b/src/deepsparse/__init__.py @@ -31,11 +31,19 @@ cpu_vnni_compatible, ) from .engine import * -from .tasks import * from .timing import * from .pipeline import * from .loggers import * from .version import __version__, is_release -from .analytics import deepsparse_analytics as _analytics -_analytics.send_event("python__init") +try: + from sparsezoo.package import check_package_version as _check_package_version + + _check_package_version( + package_name=__name__ if is_release else f"{__name__}-nightly", + package_version=__version__, + ) +except Exception as err: + print( + f"Need sparsezoo version above 0.9.0 to run Neural Magic's latest-version check\n{err}" + ) diff --git a/src/deepsparse/benchmark/__init__.py b/src/deepsparse/benchmark/__init__.py index 432d48cf44..91d264f339 100644 --- a/src/deepsparse/benchmark/__init__.py +++ b/src/deepsparse/benchmark/__init__.py @@ -14,10 +14,5 @@ # flake8: noqa -from deepsparse.analytics import deepsparse_analytics as _analytics - from .ort_engine import * from .results import * - - -_analytics.send_event("python__benchmark__init") diff --git a/src/deepsparse/benchmark/benchmark_model.py b/src/deepsparse/benchmark/benchmark_model.py index ae37d7807f..f66c36039c 100644 --- a/src/deepsparse/benchmark/benchmark_model.py +++ b/src/deepsparse/benchmark/benchmark_model.py @@ -422,7 +422,6 @@ def benchmark_model( "seconds_to_run": time, "num_streams": num_streams, "benchmark_result": benchmark_result, - "fraction_of_supported_ops": getattr(model, "fraction_of_supported_ops", None), } # Export results diff --git a/src/deepsparse/benchmark/ort_engine.py b/src/deepsparse/benchmark/ort_engine.py index d2d61e83a1..d16b14578e 100644 --- a/src/deepsparse/benchmark/ort_engine.py +++ b/src/deepsparse/benchmark/ort_engine.py @@ -19,6 +19,8 @@ import numpy from deepsparse.utils import ( + get_input_names, + get_output_names, model_to_path, override_onnx_batch_size, override_onnx_input_shapes, @@ -100,6 +102,9 @@ def __init__( self._num_cores = num_cores self._input_shapes = input_shapes + self._input_names = get_input_names(self._model_path) + self._output_names = get_output_names(self._model_path) + if providers is None: providers = onnxruntime.get_available_providers() self._providers = providers @@ -209,34 +214,6 @@ def scheduler(self) -> None: """ return None - @property - def input_names(self) -> List[str]: - """ - :return: The ordered names of the inputs. - """ - return [node_arg.name for node_arg in self._eng_net.get_inputs()] - - @property - def input_shapes(self) -> List[Tuple]: - """ - :return: The ordered shapes of the inputs. - """ - return [tuple(node_arg.shape) for node_arg in self._eng_net.get_inputs()] - - @property - def output_names(self) -> List[str]: - """ - :return: The ordered names of the outputs. - """ - return [node_arg.name for node_arg in self._eng_net.get_outputs()] - - @property - def output_shapes(self) -> List[Tuple]: - """ - :return: The ordered shapes of the outputs. - """ - return [tuple(node_arg.shape) for node_arg in self._eng_net.get_outputs()] - @property def providers(self) -> List[str]: """ @@ -282,8 +259,8 @@ def run( """ if val_inp: self._validate_inputs(inp) - inputs_dict = {name: value for name, value in zip(self.input_names, inp)} - return self._eng_net.run(self.output_names, inputs_dict) + inputs_dict = {name: value for name, value in zip(self._input_names, inp)} + return self._eng_net.run(self._output_names, inputs_dict) def timed_run( self, inp: List[numpy.ndarray], val_inp: bool = False diff --git a/src/deepsparse/cpu.py b/src/deepsparse/cpu.py index 6f4144c876..cb3ed6e7f5 100644 --- a/src/deepsparse/cpu.py +++ b/src/deepsparse/cpu.py @@ -18,10 +18,8 @@ import json import os -import platform import subprocess import sys -from distutils.version import StrictVersion from typing import Any, Tuple @@ -41,7 +39,6 @@ VALID_VECTOR_EXTENSIONS = {"avx2", "avx512", "neon", "sve"} -MINIMUM_DARWIN_VERSION = "13.0.0" class _Memoize: @@ -143,55 +140,6 @@ def _parse_arch_bin() -> architecture: raise OSError(error_msg.format(ex)) -def allow_experimental_darwin() -> bool: - """ - Check if experimental Darwin support is allowed. - """ - try: - allow = int(os.getenv("NM_ALLOW_DARWIN", "0")) - except ValueError: - allow = False - return allow - - -def get_darwin_version() -> str: - """ - If we are running Darwin, get the current version. Otherwise return None. - """ - if sys.platform.startswith("darwin"): - return platform.mac_ver()[0] - return None - - -def check_darwin_support() -> bool: - """ - Check if the system is running Darwin and it meets the minimum version - requirements. - """ - if sys.platform.startswith("darwin") and allow_experimental_darwin(): - ver = get_darwin_version() - return StrictVersion(ver) >= StrictVersion(MINIMUM_DARWIN_VERSION) - return False - - -def platform_error_msg() -> str: - """ - Generate unsupported platform error message. - """ - if allow_experimental_darwin(): - darwin_str = f" or MacOS >= {MINIMUM_DARWIN_VERSION}" - else: - darwin_str = "" - - darwin_ver = get_darwin_version() - if darwin_ver: - current_os = f"MacOS {darwin_ver}" - else: - current_os = sys.platform - - return f"Neural Magic: Only Linux{darwin_str} is supported, not '{current_os}'." - - def cpu_architecture() -> architecture: """ Detect the CPU details on linux systems @@ -207,8 +155,10 @@ def cpu_architecture() -> architecture: :return: an instance of the architecture class """ - if not (sys.platform.startswith("linux") or check_darwin_support()): - raise OSError(platform_error_msg()) + if not sys.platform.startswith("linux"): + raise OSError( + "Neural Magic: Only Linux is supported, not '{}'.".format(sys.platform) + ) arch = _parse_arch_bin() isa_type_override = os.getenv("NM_ARCH", None) diff --git a/src/deepsparse/engine.py b/src/deepsparse/engine.py index 169b36c023..4cbcf0e86c 100644 --- a/src/deepsparse/engine.py +++ b/src/deepsparse/engine.py @@ -24,13 +24,8 @@ import numpy from tqdm.auto import tqdm -from deepsparse.analytics import deepsparse_analytics as _analytics from deepsparse.benchmark import BenchmarkResults -from deepsparse.utils import ( - generate_random_inputs, - model_to_path, - override_onnx_input_shapes, -) +from deepsparse.utils import model_to_path, override_onnx_input_shapes try: @@ -186,7 +181,6 @@ def __init__( scheduler: Scheduler = None, input_shapes: List[List[int]] = None, ): - _analytics.send_event("python__engine__init") self._model_path = model_to_path(model) self._batch_size = _validate_batch_size(batch_size) self._num_cores = _validate_num_cores(num_cores) @@ -303,34 +297,6 @@ def fraction_of_supported_ops(self) -> float: """ return round(self._eng_net.fraction_of_supported_ops(), 4) - @property - def input_names(self) -> List[str]: - """ - :return: The ordered names of the inputs. - """ - return self._eng_net.input_names() - - @property - def input_shapes(self) -> List[Tuple]: - """ - :return: The ordered shapes of the inputs. - """ - return self._eng_net.input_dims() - - @property - def output_names(self) -> List[str]: - """ - :return: The ordered names of the outputs. - """ - return self._eng_net.output_names() - - @property - def output_shapes(self) -> List[Tuple]: - """ - :return: The ordered shapes of the outputs. - """ - return self._eng_net.output_dims() - @property def cpu_avx_type(self) -> str: """ @@ -348,13 +314,6 @@ def cpu_vnni(self) -> bool: """ return self._cpu_vnni - def generate_random_inputs(self) -> List[numpy.ndarray]: - """ - Generate random data that matches the type and shape of the ONNX model - :return: List of random tensors - """ - return generate_random_inputs(self.model_path, self.batch_size) - def run( self, inp: List[numpy.ndarray], @@ -571,6 +530,47 @@ def benchmark_loader( return results + def analyze( + self, + inp: List[numpy.ndarray], + num_iterations: int = 20, + num_warmup_iterations: int = 5, + optimization_level: int = 1, + imposed_as: Optional[float] = None, + imposed_ks: Optional[float] = None, + ): + """ + Function to analyze a model's performance in the DeepSparse Engine. + + Note 1: Analysis is currently only supported on a single socket. + + :param inp: The list of inputs to pass to the engine for analyzing inference. + The expected order is the inputs order as defined in the ONNX graph. + :param num_iterations: The number of times to repeat execution of the model + while analyzing, default is 20 + :param num_warmup_iterations: The number of times to repeat execution of the model + before analyzing, default is 5 + :param optimization_level: The amount of graph optimizations to perform. + The current choices are either 0 (minimal) or 1 (all), default is 1 + :param imposed_as: Imposed activation sparsity, defaults to None. + Will force the activation sparsity from all ReLu layers in the graph + to match this desired sparsity level (percentage of 0's in the tensor). + Beneficial for seeing how AS affects the performance of the model. + :param imposed_ks: Imposed kernel sparsity, defaults to None. + Will force all prunable layers in the graph to have weights with + this desired sparsity level (percentage of 0's in the tensor). + Beneficial for seeing how pruning affects the performance of the model. + :return: the analysis structure containing the performance details of each layer + """ + return self._eng_net.benchmark( + inp, + num_iterations, + num_warmup_iterations, + optimization_level, + imposed_as, + imposed_ks, + ) + def _validate_inputs(self, inp: List[numpy.ndarray]): if isinstance(inp, str) or not isinstance(inp, List): raise ValueError("inp must be a list, given {}".format(type(inp))) @@ -603,115 +603,6 @@ def _properties_dict(self) -> Dict: } -class DebugAnalysisEngine(Engine): - """ - A subclass of Engine that supports debug analysis. - - :param model: Either a path to the model's onnx file, a SparseZoo model stub - prefixed by 'zoo:', a SparseZoo Model object, or a SparseZoo ONNX File - object that defines the neural network - :param batch_size: The batch size of the inputs to be used with the engine - :param num_cores: The number of physical cores to run the model on. If more - cores are requested than are available on a single socket, the engine - will try to distribute them evenly across as few sockets as possible. - :param num_streams: The max number of requests the model can handle - concurrently. - :param scheduler: The kind of scheduler to execute with. Pass None for the default. - :param input_shapes: The list of shapes to set the inputs to. Pass None to use model as-is. - :param num_iterations: The number of iterations to run benchmarking for. - Default is 20 - :param num_warmup_iterations: T number of iterations to warm up engine before - benchmarking. These executions will not be counted in the benchmark - results that are returned. Useful and recommended to bring - the system to a steady state. Default is 5 - :param include_inputs: If True, inputs from forward passes during benchmarking - will be added to the results. Default is False - :param include_outputs: If True, outputs from forward passes during benchmarking - will be added to the results. Default is False - :param show_progress: If True, will display a progress bar. Default is False - :param scheduler: The kind of scheduler to execute with. Pass None for the default. - """ - - def __init__( - self, - model: Union[str, "Model", "File"], - batch_size: int = 1, - num_cores: int = None, - scheduler: Scheduler = None, - input_shapes: List[List[int]] = None, - num_iterations: int = 20, - num_warmup_iterations: int = 5, - optimization_level: int = 1, - imposed_as: Optional[float] = None, - imposed_ks: Optional[float] = None, - ): - self._model_path = model_to_path(model) - self._batch_size = _validate_batch_size(batch_size) - self._num_cores = _validate_num_cores(num_cores) - self._scheduler = _validate_scheduler(scheduler) - self._input_shapes = input_shapes - self._cpu_avx_type = AVX_TYPE - self._cpu_vnni = VNNI - - num_streams = _validate_num_streams(None, self._num_cores) - if self._input_shapes: - with override_onnx_input_shapes( - self._model_path, self._input_shapes - ) as model_path: - self._eng_net = LIB.deepsparse_engine( - model_path, - self._batch_size, - self._num_cores, - num_streams, - self._scheduler.value, - None, - "external", - num_iterations, - num_warmup_iterations, - optimization_level, - imposed_as, - imposed_ks, - ) - else: - self._eng_net = LIB.deepsparse_engine( - self._model_path, - self._batch_size, - self._num_cores, - num_streams, - self._scheduler.value, - None, - "external", - num_iterations, - num_warmup_iterations, - optimization_level, - imposed_as, - imposed_ks, - ) - - def analyze( - self, - inp: List[numpy.ndarray], - val_inp: bool = True, - ) -> List[numpy.ndarray]: - """ - Function to analyze a model's performance in the DeepSparse Engine. - - Note 1: Analysis is currently only supported on a single socket. - - :param inp: The list of inputs to pass to the engine for analyzing inference. - The expected order is the inputs order as defined in the ONNX graph. - :param val_inp: Validate the input to the model to ensure numpy array inputs - are setup correctly for the DeepSparse Engine - :return: the analysis structure containing the performance details of each layer - """ - if val_inp: - self._validate_inputs(inp) - - [out, bench_info] = self._eng_net.benchmark_execute(inp) - - return bench_info - - class Context(object): """ Contexts can be used to run multiple instances of the MultiModelEngine with the same @@ -969,17 +860,19 @@ def model_debug_analysis( :param scheduler: The kind of scheduler to execute with. Pass None for the default. :return: the analysis structure containing the performance details of each layer """ - model = DebugAnalysisEngine( + model = compile_model( model=model, batch_size=batch_size, num_cores=num_cores, scheduler=scheduler, input_shapes=input_shapes, + ) + + return model.analyze( + inp, num_iterations=num_iterations, num_warmup_iterations=num_warmup_iterations, optimization_level=optimization_level, imposed_as=imposed_as, imposed_ks=imposed_ks, ) - - return model.analyze(inp) diff --git a/src/deepsparse/image_classification/__init__.py b/src/deepsparse/image_classification/__init__.py index cf62c40992..00ceb5828e 100644 --- a/src/deepsparse/image_classification/__init__.py +++ b/src/deepsparse/image_classification/__init__.py @@ -18,10 +18,6 @@ import warnings from collections import namedtuple -from deepsparse.analytics import deepsparse_analytics as _analytics - - -_analytics.send_event("python__image_classification__init") _LOGGER = _logging.getLogger(__name__) _Dependency = namedtuple("_Dependency", ["name", "import_name", "version", "necessary"]) diff --git a/src/deepsparse/open_pif_paf/__init__.py b/src/deepsparse/open_pif_paf/__init__.py index 78fe2add68..8d3ec2e88e 100644 --- a/src/deepsparse/open_pif_paf/__init__.py +++ b/src/deepsparse/open_pif_paf/__init__.py @@ -12,9 +12,4 @@ # See the License for the specific language governing permissions and # limitations under the License. # flake8: noqa -from deepsparse.analytics import deepsparse_analytics as _analytics - from .utils import * - - -_analytics.send_event("python__open_pif_paf__init") diff --git a/src/deepsparse/server/README.md b/src/deepsparse/server/README.md index c564ef1363..1f8a84ebb4 100644 --- a/src/deepsparse/server/README.md +++ b/src/deepsparse/server/README.md @@ -54,7 +54,7 @@ Usage: deepsparse.server [OPTIONS] COMMAND [ARGS]... prometheus: port: 6100 text_log_save_dir: /home/deepsparse-server/prometheus - text_log_save_frequency: 30 + text_log_save_freq: 30 endpoints: - task: question_answering ... diff --git a/src/deepsparse/server/__init__.py b/src/deepsparse/server/__init__.py index 3ff5d0fb10..4e63b031c7 100644 --- a/src/deepsparse/server/__init__.py +++ b/src/deepsparse/server/__init__.py @@ -19,9 +19,4 @@ the DeepSparse Engine. """ -from deepsparse.analytics import deepsparse_analytics as _analytics - from .cli import main - - -_analytics.send_event("python__server__init") diff --git a/src/deepsparse/server/cli.py b/src/deepsparse/server/cli.py index 1b323e28e3..82aed430b0 100644 --- a/src/deepsparse/server/cli.py +++ b/src/deepsparse/server/cli.py @@ -199,7 +199,7 @@ def main( prometheus: port: 6100 text_log_save_dir: /home/deepsparse-server/prometheus - text_log_save_frequency: 30 + text_log_save_freq: 30 endpoints: - task: question_answering ... diff --git a/src/deepsparse/transformers/__init__.py b/src/deepsparse/transformers/__init__.py index 3e084cece9..b0f7ca22f6 100644 --- a/src/deepsparse/transformers/__init__.py +++ b/src/deepsparse/transformers/__init__.py @@ -22,10 +22,6 @@ import logging as _logging import pkg_resources -from deepsparse.analytics import deepsparse_analytics as _analytics - - -_analytics.send_event("python__transformers__init") _EXPECTED_VERSION = "4.23.1" diff --git a/src/deepsparse/utils/onnx.py b/src/deepsparse/utils/onnx.py index c571e62850..aec78e885a 100644 --- a/src/deepsparse/utils/onnx.py +++ b/src/deepsparse/utils/onnx.py @@ -21,7 +21,6 @@ import numpy import onnx -from onnx.mapping import TENSOR_TYPE_TO_NP_TYPE from deepsparse.utils.extractor import Extractor @@ -36,6 +35,7 @@ sparsezoo_import_error = sparsezoo_err __all__ = [ + "ONNX_TENSOR_TYPE_MAP", "model_to_path", "get_external_inputs", "get_external_outputs", @@ -50,6 +50,23 @@ _LOGGER = logging.getLogger(__name__) +ONNX_TENSOR_TYPE_MAP = { + 1: numpy.float32, + 2: numpy.uint8, + 3: numpy.int8, + 4: numpy.uint16, + 5: numpy.int16, + 6: numpy.int32, + 7: numpy.int64, + 9: numpy.bool_, + 10: numpy.float16, + 11: numpy.float64, + 12: numpy.uint32, + 13: numpy.uint64, + 14: numpy.complex64, + 15: numpy.complex128, +} + def save_onnx(model: Model, model_path: str, external_data_file: str) -> bool: """ @@ -98,9 +115,9 @@ def translate_onnx_type_to_numpy(tensor_type: int): :param tensor_type: Integer representing a type in ONNX spec :return: Corresponding numpy type """ - if tensor_type not in TENSOR_TYPE_TO_NP_TYPE: + if tensor_type not in ONNX_TENSOR_TYPE_MAP: raise Exception("Unknown ONNX tensor type = {}".format(tensor_type)) - return TENSOR_TYPE_TO_NP_TYPE[tensor_type] + return ONNX_TENSOR_TYPE_MAP[tensor_type] def model_to_path(model: Union[str, Model, File]) -> str: diff --git a/src/deepsparse/yolact/__init__.py b/src/deepsparse/yolact/__init__.py index bee4474d74..86aaaa5de4 100644 --- a/src/deepsparse/yolact/__init__.py +++ b/src/deepsparse/yolact/__init__.py @@ -18,10 +18,6 @@ import warnings from collections import namedtuple -from deepsparse.analytics import deepsparse_analytics as _analytics - - -_analytics.send_event("python__yolact__init") _LOGGER = _logging.getLogger(__name__) _Dependency = namedtuple("_Dependency", ["name", "version", "necessary", "import_name"]) diff --git a/src/deepsparse/yolo/__init__.py b/src/deepsparse/yolo/__init__.py index 28a2af36d0..135b18a839 100644 --- a/src/deepsparse/yolo/__init__.py +++ b/src/deepsparse/yolo/__init__.py @@ -14,11 +14,6 @@ # flake8: noqa -from deepsparse.analytics import deepsparse_analytics as _analytics - from .annotate import * from .pipelines import * from .schemas import * - - -_analytics.send_event("python__yolov5__init") diff --git a/src/deepsparse/yolo/pipelines.py b/src/deepsparse/yolo/pipelines.py index 935fc9a1d4..c3866433f3 100644 --- a/src/deepsparse/yolo/pipelines.py +++ b/src/deepsparse/yolo/pipelines.py @@ -163,12 +163,6 @@ class properties into an inference ready onnx file to be compiled by the model_path = model_to_path(self.model_path) if self._image_size is None: self._image_size = get_onnx_expected_image_shape(onnx.load(model_path)) - if self._image_size == (0, 0): - raise ValueError( - "The model does not have a static image size shape. " - "Specify the expected image size by passing the" - "`image_size` argument to the pipeline." - ) else: # override model input shape to given image size if isinstance(self._image_size, int): diff --git a/src/deepsparse/yolo/utils/utils.py b/src/deepsparse/yolo/utils/utils.py index baa4c18721..07e7b87ec2 100644 --- a/src/deepsparse/yolo/utils/utils.py +++ b/src/deepsparse/yolo/utils/utils.py @@ -359,8 +359,6 @@ def modify_yolo_onnx_input_shape( model_input = model.graph.input[0] initial_x, initial_y = get_onnx_expected_image_shape(model) - if initial_x == initial_y == 0: - initial_x, initial_y = image_shape if not (isinstance(initial_x, int) and isinstance(initial_y, int)): return model_path, None # model graph does not have static integer input shape diff --git a/src/deepsparse/yolov8/__init__.py b/src/deepsparse/yolov8/__init__.py index a55c36903d..9efc49cd88 100644 --- a/src/deepsparse/yolov8/__init__.py +++ b/src/deepsparse/yolov8/__init__.py @@ -14,13 +14,8 @@ # flake8: noqa -from deepsparse.analytics import deepsparse_analytics as _analytics - from .annotate import * from .pipelines import * from .schemas import * from .utils import * from .validation import * - - -_analytics.send_event("python__yolov8__init") diff --git a/src/deepsparse/yolov8/utils/validation/helpers.py b/src/deepsparse/yolov8/utils/validation/helpers.py index a951ae8f5d..0db35c462d 100644 --- a/src/deepsparse/yolov8/utils/validation/helpers.py +++ b/src/deepsparse/yolov8/utils/validation/helpers.py @@ -12,53 +12,16 @@ # See the License for the specific language governing permissions and # limitations under the License. import argparse -import glob import os import warnings from typing import List, Optional, Union -import yaml - import torch from deepsparse.yolo import YOLOOutput as YOLODetOutput from deepsparse.yolov8.schemas import YOLOSegOutput -from ultralytics.yolo.data.utils import ROOT - - -__all__ = ["data_from_dataset_path", "schema_to_tensor", "check_coco128_segmentation"] - - -def data_from_dataset_path(data: str, dataset_path: str) -> str: - """ - Given a dataset name, fetch the yaml config for the dataset - from the Ultralytics dataset repo, overwrite its 'path' - attribute (dataset root dir) to point to the `dataset_path` - and finally save it to the current working directory. - This allows to create load data yaml config files that point - to the arbitrary directories on the disk. - - :param data: name of the dataset (e.g. "coco.yaml") - :param dataset_path: path to the dataset directory - :return: a path to the new yaml config file - (saved in the current working directory) - """ - ultralytics_dataset_path = glob.glob(os.path.join(ROOT, "**", data), recursive=True) - if len(ultralytics_dataset_path) != 1: - raise ValueError( - "Expected to find a single path to the " - f"dataset yaml file: {data}, but found {ultralytics_dataset_path}" - ) - ultralytics_dataset_path = ultralytics_dataset_path[0] - with open(ultralytics_dataset_path, "r") as f: - yaml_config = yaml.safe_load(f) - yaml_config["path"] = dataset_path - yaml_save_path = os.path.join(os.getcwd(), data) - # save the new dataset yaml file - with open(yaml_save_path, "w") as outfile: - yaml.dump(yaml_config, outfile, default_flow_style=False) - return yaml_save_path +__all__ = ["schema_to_tensor", "check_coco128_segmentation"] def schema_to_tensor( diff --git a/src/deepsparse/yolov8/validation.py b/src/deepsparse/yolov8/validation.py index 7412b4975c..cc8fd1fbaa 100644 --- a/src/deepsparse/yolov8/validation.py +++ b/src/deepsparse/yolov8/validation.py @@ -12,8 +12,6 @@ # See the License for the specific language governing permissions and # limitations under the License. -from typing import Optional - import click from deepsparse import Pipeline @@ -22,7 +20,6 @@ DeepSparseDetectionValidator, DeepSparseSegmentationValidator, check_coco128_segmentation, - data_from_dataset_path, ) from ultralytics.yolo.cfg import get_cfg from ultralytics.yolo.utils import DEFAULT_CFG @@ -66,6 +63,16 @@ show_default=True, help="Validation batch size", ) +@click.option( + "--stride", + type=int, + default=32, + show_default=True, + help="YOLOv8 can handle arbitrary sized images as long as " + "both sides are a multiple of 32. This is because the " + "maximum stride of the backbone is 32 and it is a fully " + "convolutional network.", +) @click.option( "--engine-type", default=DEEPSPARSE_ENGINE, @@ -88,21 +95,15 @@ show_default=True, help="A subtask of YOLOv8 to run. Default is `detection`.", ) -@click.option( - "--dataset-path", - type=str, - default=None, - help="Path to override default dataset path.", -) def main( dataset_yaml: str, model_path: str, batch_size: int, num_cores: int, engine_type: str, + stride: int, device: str, subtask: str, - dataset_path: Optional[str], ): pipeline = Pipeline.create( @@ -123,8 +124,6 @@ def main( f"Dataset yaml {dataset_yaml} is not supported. " f"Supported dataset configs are {SUPPORTED_DATASET_CONFIGS})" ) - if dataset_path is not None: - args.data = data_from_dataset_path(args.data, dataset_path) classes = {label: class_ for (label, class_) in enumerate(COCO_CLASSES)} if subtask == "detection":