Skip to content

Commit

Permalink
Fix merge with main
Browse files Browse the repository at this point in the history
  • Loading branch information
mgoin committed Apr 25, 2023
1 parent 1da576b commit c86e65f
Show file tree
Hide file tree
Showing 6 changed files with 1,048 additions and 17 deletions.
174 changes: 172 additions & 2 deletions src/deepsparse/benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,176 @@ See the License for the specific language governing permissions and
limitations under the License.
-->

# DeepSparse Benchmarking
## 📜 Benchmarking ONNX Models

[Checkout DeepSparse Benchmarking User Guide for usage details](../../../docs/user-guide/deepsparse-benchmarking.md)
`deepsparse.benchmark` is a command-line (CLI) tool for benchmarking the DeepSparse Engine with ONNX models. The tool will parse the arguments, download/compile the network into the engine, generate input tensors, and execute the model depending on the chosen scenario. By default, it will choose a multi-stream or asynchronous mode to optimize for throughput.

### Quickstart

After `pip install deepsparse`, the benchmark tool is available on your CLI. For example, to benchmark a dense BERT ONNX model fine-tuned on the SST2 dataset where the model path is the minimum input required to get started, run:

```
deepsparse.benchmark zoo:nlp/text_classification/bert-base/pytorch/huggingface/sst2/base-none
```
__ __
### Usage

In most cases, good performance will be found in the default options so it can be as simple as running the command with a SparseZoo model stub or your local ONNX model. However, if you prefer to customize benchmarking for your personal use case, you can run `deepsparse.benchmark -h` or with `--help` to view your usage options:

CLI Arguments:
```
positional arguments:
model_path Path to an ONNX model file or SparseZoo model stub.
optional arguments:
-h, --help show this help message and exit.
-b BATCH_SIZE, --batch_size BATCH_SIZE
The batch size to run the analysis for. Must be
greater than 0.
-shapes INPUT_SHAPES, --input_shapes INPUT_SHAPES
Override the shapes of the inputs, i.e. -shapes
"[1,2,3],[4,5,6],[7,8,9]" results in input0=[1,2,3]
input1=[4,5,6] input2=[7,8,9].
-ncores NUM_CORES, --num_cores NUM_CORES
The number of physical cores to run the analysis on,
defaults to all physical cores available on the system.
-s {async,sync,elastic}, --scenario {async,sync,elastic}
Choose between using the async, sync and elastic
scenarios. Sync and async are similar to the single-
stream/multi-stream scenarios. Elastic is a newer
scenario that behaves similarly to the async scenario
but uses a different scheduling backend. Default value
is async.
-t TIME, --time TIME
The number of seconds the benchmark will run. Default
is 10 seconds.
-w WARMUP_TIME, --warmup_time WARMUP_TIME
The number of seconds the benchmark will warmup before
running.Default is 2 seconds.
-nstreams NUM_STREAMS, --num_streams NUM_STREAMS
The number of streams that will submit inferences in
parallel using async scenario. Default is
automatically determined for given hardware and may be
sub-optimal.
-pin {none,core,numa}, --thread_pinning {none,core,numa}
Enable binding threads to cores ('core' the default),
threads to cores on sockets ('numa'), or disable
('none').
-e {deepsparse,onnxruntime}, --engine {deepsparse,onnxruntime}
Inference engine backend to run eval on. Choices are
'deepsparse', 'onnxruntime'. Default is 'deepsparse'.
-q, --quiet Lower logging verbosity.
-x EXPORT_PATH, --export_path EXPORT_PATH
Store results into a JSON file.
```
💡**PRO TIP**💡: save your benchmark results in a convenient JSON file!

Example CLI command for benchmarking an ONNX model from the SparseZoo and saving the results to a `benchmark.json` file:

```
deepsparse.benchmark zoo:nlp/text_classification/bert-base/pytorch/huggingface/sst2/base-none -x benchmark.json
```
Output of the JSON file:

![alt text](./img/json_output.png)

#### Sample CLI Argument Configurations

To run a sparse FP32 MobileNetV1 at batch size 16 for 10 seconds for throughput using 8 streams of requests:

```
deepsparse.benchmark zoo:cv/classification/mobilenet_v1-1.0/pytorch/sparseml/imagenet/pruned-moderate --batch_size 16 --time 10 --scenario async --num_streams 8
```

To run a sparse quantized INT8 6-layer BERT at batch size 1 for latency:

```
deepsparse.benchmark zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_quant_6layers-aggressive_96 --batch_size 1 --scenario sync
```
__ __
### ⚡ Inference Scenarios

#### Synchronous (Single-stream) Scenario

Set by the `--scenario sync` argument, the goal metric is latency per batch (ms/batch). This scenario submits a single inference request at a time to the engine, recording the time taken for a request to return an output. This mimics an edge deployment scenario.

The latency value reported is the mean of all latencies recorded during the execution period for the given batch size.

#### Asynchronous (Multi-stream) Scenario

Set by the `--scenario async` argument, the goal metric is throughput in items per second (i/s). This scenario submits `--num_streams` concurrent inference requests to the engine, recording the time taken for each request to return an output. This mimics a model server or bulk batch deployment scenario.

The throughput value reported comes from measuring the number of finished inferences within the execution time and the batch size.

#### Example Benchmarking Output of Synchronous vs. Asynchronous

**BERT 3-layer FP32 Sparse Throughput**

No need to add *scenario* argument since `async` is the default option:
```
deepsparse.benchmark zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_3layers-aggressive_83
[INFO benchmark_model.py:202 ] Thread pinning to cores enabled
DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 0.10.0 (9bba6971) (optimized) (system=avx512, binary=avx512)
[INFO benchmark_model.py:247 ] deepsparse.engine.Engine:
onnx_file_path: /home/mgoin/.cache/sparsezoo/c89f3128-4b87-41ae-91a3-eae8aa8c5a7c/model.onnx
batch_size: 1
num_cores: 18
scheduler: Scheduler.multi_stream
cpu_avx_type: avx512
cpu_vnni: False
[INFO onnx.py:176 ] Generating input 'input_ids', type = int64, shape = [1, 384]
[INFO onnx.py:176 ] Generating input 'attention_mask', type = int64, shape = [1, 384]
[INFO onnx.py:176 ] Generating input 'token_type_ids', type = int64, shape = [1, 384]
[INFO benchmark_model.py:264 ] num_streams default value chosen of 9. This requires tuning and may be sub-optimal
[INFO benchmark_model.py:270 ] Starting 'async' performance measurements for 10 seconds
Original Model Path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_3layers-aggressive_83
Batch Size: 1
Scenario: multistream
Throughput (items/sec): 83.5037
Latency Mean (ms/batch): 107.3422
Latency Median (ms/batch): 107.0099
Latency Std (ms/batch): 12.4016
Iterations: 840
```

**BERT 3-layer FP32 Sparse Latency**

To select a *synchronous inference scenario*, add `-s sync`:

```
deepsparse.benchmark zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_3layers-aggressive_83 -s sync
[INFO benchmark_model.py:202 ] Thread pinning to cores enabled
DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 0.10.0 (9bba6971) (optimized) (system=avx512, binary=avx512)
[INFO benchmark_model.py:247 ] deepsparse.engine.Engine:
onnx_file_path: /home/mgoin/.cache/sparsezoo/c89f3128-4b87-41ae-91a3-eae8aa8c5a7c/model.onnx
batch_size: 1
num_cores: 18
scheduler: Scheduler.single_stream
cpu_avx_type: avx512
cpu_vnni: False
[INFO onnx.py:176 ] Generating input 'input_ids', type = int64, shape = [1, 384]
[INFO onnx.py:176 ] Generating input 'attention_mask', type = int64, shape = [1, 384]
[INFO onnx.py:176 ] Generating input 'token_type_ids', type = int64, shape = [1, 384]
[INFO benchmark_model.py:270 ] Starting 'sync' performance measurements for 10 seconds
Original Model Path: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned_3layers-aggressive_83
Batch Size: 1
Scenario: singlestream
Throughput (items/sec): 62.1568
Latency Mean (ms/batch): 16.0732
Latency Median (ms/batch): 15.7850
Latency Std (ms/batch): 1.0427
Iterations: 622
```
200 changes: 198 additions & 2 deletions src/deepsparse/image_classification/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,199 @@
# Image Classification Use Case
# Image Classification Inference Pipelines

[Checkout DeepSparse Use Cases for usage details](../../../docs/use-cases/cv/image-classification.md)

[DeepSparse] Image Classification integration allows accelerated inference,
serving, and benchmarking of sparsified image classification models.
This integration allows for leveraging the DeepSparse Engine to run
sparsified image classification inference with GPU-class performance directly
on the CPU.

The DeepSparse Engine takes advantage of sparsity within neural networks to
reduce compute as well as accelerate memory-bound workloads.
The Engine is particularly effective when leveraging sparsification methods
such as [pruning](https://neuralmagic.com/blog/pruning-overview/) and
[quantization](https://arxiv.org/abs/1609.07061). These techniques result in
significantly more performant and smaller models with limited to no effect on
the baseline metrics.

## Getting Started

Before you start your adventure with the DeepSparse Engine, make sure that
your machine is compatible with our [hardware requirements].

### Installation

```pip install deepsparse```

### Model Format

By default, to deploy image classification models using the DeepSparse Engine,
the model should be supplied in the [ONNX] format.
This grants the Engine the flexibility to serve any model in a framework-agnostic
manner.

Below we describe two possibilities to obtain the required ONNX model.

#### Exporting the onnx file from the contents of a local checkpoint

This pathway is relevant if you intend to deploy a model created using [SparseML] library.
For more information refer to the appropriate integration documentation in [SparseML].

1. The output of the [SparseML] training is saved to output directory `/{save_dir}` (e.g. `/trained_model`)
2. Depending on the chosen framework, the model files are saved to `model_path`=`/{save_dir}/{framework_name}/{model_tag}` (e.g `/trained_model/pytorch/resnet50/`)
3. To generate an onnx model, refer to the [script for image classification ONNX export](https://github.com/neuralmagic/sparseml/blob/main/src/sparseml/pytorch/image_classification/export.py).

Example:
```bash
sparseml.image_classification.export_onnx \
--arch-key resnet50 \
--dataset imagenet \
--dataset-path ~/datasets/ILSVRC2012 \
--checkpoint-path ~/checkpoints/resnet50_checkpoint.pth
```
This creates `model.onnx` file, in the parent directory of your `model_path`

#### Directly using the SparseZoo stub

Alternatively, you can skip the process of onnx model export by downloading all the required model data directly from Neural Magic's [SparseZoo](https://sparsezoo.neuralmagic.com/).
Example:
```python
from sparsezoo import Model

# you can lookup an appropriate model stub here: https://sparsezoo.neuralmagic.com/
model_stub = "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none"
model = Model(model_stub)

# directly download the model data to your local directory
model_path = model.path

# the onnx model file is there, ready for deployment
import os
os.path.isfile(model.onnx_model.path)
>>>True
```


## Deployment APIs

DeepSparse provides both a python Pipeline API and an out-of-the-box model
server that can be used for end-to-end inference in either existing python
workflows or as an HTTP endpoint. Both options provide similar specifications
for configurations and support a variety of Image Classification models.

### Python API

Pipelines are the default interface for running the inference with the
DeepSparse Engine.

Once a model is obtained, either through [SparseML] training or directly from [SparseZoo],
`deepsparse.Pipeline` can be used to easily facilitate end to end inference and deployment
of the sparsified image classification model.

If no model is specified to the `Pipeline` for a given task, the `Pipeline` will automatically
select a pruned and quantized model for the task from the `SparseZoo` that can be used for accelerated
inference. Note that other models in the [SparseZoo] will have different tradeoffs between speed, size,
and accuracy.

To learn about sparsification in more detail, refer to [SparseML docs](https://docs.neuralmagic.com/sparseml/)

### HTTP Server

As an alternative to Python API, the DeepSparse inference server allows you to
serve ONNX models and pipelines in HTTP. Both configuring and making requests
to the server follow the same parameters and schemas as the Pipelines enabling
simple deployment. Once launched, a `/docs` endpoint is created with full
endpoint descriptions and support for making sample requests.

Example deployment using a 95% pruned resnet50 is given below
For full documentation on deploying sparse image classification models with the
DeepSparse Server, see the [documentation](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/server).

##### Installation

The deepsparse server requirements can be installed by specifying the `server`
extra dependency when installing DeepSparse.

```bash
pip install deepsparse[server]
```

## Deployment Use Cases

The following section includes example usage of the Pipeline and server APIs for
various image classification models.

[List of Image Classification SparseZoo Models](https://sparsezoo.neuralmagic.com/?domain=cv&sub_domain=classification&page=1)


#### Python Pipeline

```python
from deepsparse import Pipeline
cv_pipeline = Pipeline.create(
task='image_classification',
model_path='zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none', # Path to checkpoint or SparseZoo stub
)
input_image = "my_image.png" # path to input image
inference = cv_pipeline(images=input_image)
```

#### HTTP Server

Spinning up:
```bash
deepsparse.server \
task image_classification \
--model_path "zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none" \
--port 5543
```

Making a request:
```python
import requests

url = 'http://0.0.0.0:5543/predict/from_files'
path = ['goldfish.jpeg'] # just put the name of images in here
files = [('request', open(img, 'rb')) for img in path]
resp = requests.post(url=url, files=files)
```

### Benchmarking

The mission of Neural Magic is to enable GPU-class inference performance on commodity CPUs.
Want to find out how fast our sparse ONNX models perform inference?
You can quickly do benchmarking tests on your own with a single CLI command!

You only need to provide the model path of a SparseZoo ONNX model or your own local ONNX model to get started:
```bash
deepsparse.benchmark zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none
```
Output:
```bash
Original Model Path: zoo:cv/classification/resnet_v1-50/pytorch/sparseml/imagenet/pruned95-none
Batch Size: 1
Scenario: async
Throughput (items/sec): 299.2372
Latency Mean (ms/batch): 16.6677
Latency Median (ms/batch): 16.6748
Latency Std (ms/batch): 0.1728
Iterations: 2995
```

To learn more about benchmarking, refer to the appropriate documentation.
Also, check out our [Benchmarking tutorial](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/benchmark)!

## Tutorials:
For a deeper dive into using image classification models within the Neural Magic
ecosystem, refer to the detailed tutorials on our [website](https://neuralmagic.com/):
- [CV Use Cases](https://neuralmagic.com/use-cases/#computervision)

## Support
For Neural Magic Support, sign up or log in to our [Deep Sparse Community Slack](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ). Bugs, feature requests, or additional questions can also be posted to our [GitHub Issue Queue](https://github.com/neuralmagic/deepsparse/issues).


[DeepSparse]: https://github.com/neuralmagic/deepsparse
[hardware requirements]: https://docs.neuralmagic.com/deepsparse/source/hardware.html
[ONNX]: https://onnx.ai/
[SparseML]: https://github.com/neuralmagic/sparseml
[SparseML Image Classification Documentation]: https://github.com/neuralmagic/sparseml/tree/main/src/sparseml/pytorch/image_classification/README_image_classification.md
[SparseZoo]: https://sparsezoo.neuralmagic.com/
1 change: 1 addition & 0 deletions src/deepsparse/server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,3 +152,4 @@ All you need is to add `/docs` at the end of your host URL:
localhost:5543/docs

![alt text](./img/swagger_ui.png)

Loading

0 comments on commit c86e65f

Please sign in to comment.