Release DeepSparse v0.12.0 · neuralmagic/deepsparse

New Features:

Documentation:

SparseServer.UI: a Streamlit app for deploying the DeepSparse Server for exploring the inference performance of BERT on the question answering task.
DeepSparse Server README: deepsparse.server capabilities, including single model and multi-model inferencing.
Twitter NLP Inference Examples added.

Performance:

Speedup for large batch sizes when using sync mode on AMD EPYC processors.
AVX2 improvements for
- Up to 40% speedup out of the box for dense quantized models.
- Up to 20% speedup for pruned quantized BERT, ResNet-50 and MobileNet.
Speedup from sparsity realized for ConvInteger operators.
Model compilation time decreased on systems with many cores.
Multi-stream Scheduler: certain computations that were executed during runtime are now precomputed.
Hugging Face Transformers integration updated to latest state from upstream main branch.

Documentation:

DeepSparse README: references to deepsparse.server, deepsparse.benchmark, and Transformer pipelines.
DeepSparse Benchmark README: highlights of deepsparse.benchmark CLI command.
Transformers 🤗 Inference Pipelines: examples included on how to run inference via Python for several NLP tasks.

When running quantized BERT with a sequence length not divisible by 4, the DeepSparse Engine will no longer disable optimizations and see very poor performance.
Users executing arch.bin now receive a correct architecture profile of their system.

When running the DeepSparse engine on a system with a nonuniform system topology, for example, an AMD EPYC processor where some cores per core-complex (CCX) have been disabled, model compilation will never terminate. A workaround is to set the environment variable NM_SERIAL_UNIT_GENERATION=1.