Skip to content
This repository has been archived by the owner on May 1, 2023. It is now read-only.

Frequently Asked Questions (FAQ)

Neta Zmora edited this page Aug 1, 2019 · 13 revisions

In this page we answer frequently asked questions.

Table of Contents

Pruning and Sparsity

Q1: I pruned my model using an element-wise pruner (also known as fine-pruning) and the size of the weights tensors is not reduced. What's going on?

A1: There are different types of sparsity patterns, with element-wise sparsity being the simplest case. When you perform fine-grained pruning, you produce tensors that are sparse at the element granularity. The weight tensors are not reduced in size because the zero-coefficients are still present in the tensor. Some NN accelerators (ASICs) take advantage of fine-grained sparsity by using a compressed representation of sparse tensors. An Investigation of Sparse Tensor Formats for Tensor Libraries provides a review of some of these representations, such as the Compressed Sparse Row format. When sparse weight tensors are represented using a compact format, they are stored in memory using the compact format which reduces the bandwidth and power required to fetch them into the neural processing unit. Once the compact tensor is read (in full, or in part) it can be converted back to a dense tensor format, to perform the neural compute operation. A further acceleration is achieved if the hardware can instead perform the compute operation directly using the tensor in its compact representation.
The diagram below shows fine-grain sparsity in comparison to other sparsity patterns (source: Exploring the Regularity of Sparse Structure in Convolutional Neural Networks which provides an exploration of the different types of sparsity patterns).

sparsity patterns


Q2: I pruned my model using an element-wise pruner and I don't see an improvement in the run-time. What's going on?

A2: The answer to the question above explains the necessity to use specialized hardware to see a performance gain from fine-grained weights sparsity. Currently the PyTorch software stack does not support sparse tensor representations in the main NN operations (e.g. Convolution and GEMM) so even with the best hardware, you can only see a performance boost when exporting the PyTorch model to ONNX, and executing the ONNX model on hardware with support for sparse representation.


Q3: I pruned my model using an block-structure pruner and I don't see an improvement in the run-time. What's going on?

A3: Block pruning refers to pruning 4-D structures of a a specific shape. This is similar to filter/channel pruning but allows for non-regular shapes that accelerate inference on a specific hardware platform. If we want to introduce sparsity in order to reduce the compute load of a certain layer, we need to understand how the HW and SW perform the layer's operation and what vector shape is used. Then we can induce sparsity to match the vector shape. For example, Intel AVX-512 are SIMD instructions that apply the same instruction (Single Instruction) on a vector of inputs (Multiple Data). The following single instruction performs an element-wise multiplication of two 16 32-bit elements:

__m256i result = __mm256_mul_epi32(vec_a, vec_b);

vec_a and vec_b may represent activations and weights, respectively. If either vec_a or vec_b are partially sparse, we still need to perform the multiplication operation and the sparsity does not help reduce the compute latency. However, if either vec_a or vec_b contain only zeros then we can eliminate entirely the instruction. In this case, we would like to have sparsity in pattern blocks of 16-elements. Things are a bit more complicated because we also need to understand how the software maps layer operations to hardware. For example, a 3x3 convolution can be computed as a direct-convolution, as a matrix multiply operation (GEMM), or as a Winograd matrix operation (to name a few ways of computation). These low-level operations are then mapped to SIMD instructions. Finally, the low-level SW needs to support a block-sparse storage-format for the weight tensors as explained in one of the answers above (see for example: http://www.netlib.org/linalg/html_templates/node90.html). The model is exported to ONNX for execution on a deployment HW-SW platform that can recognize sparsity patterns embedded in weight tensors and convert the tensors to their compact storage format.

In summary, different hardware benefit from different sparsity patterns.


Q3: I pruned my model using a channel/filter-pruner and the the weights tensors are sparse, but their shapes and sizes are as before the pruning. What's going on?

A3: To change the shape of weights tensors after pruning channels/filters, you need to use 'thinning'. See an example here which defines and uses a FilterRemover to remove zeroed filters from a model.

extensions:
  net_thinner:
      class: 'FilterRemover'
      thinning_func_str: remove_filters
      arch: 'resnet20_cifar'
      dataset: 'cifar10'

Quantization

Q1: I quantized my model, but it is not running faster than the FP32 model. What's going on?

A1: As currently implemented, Distiller only simulates post-training quantization. This allows us to study the effect of quantization on accuracy. It does not, unfortunately, provide insights on the actual runtime of the quantized model. We did not implement low-level specialized operations that utilize 8-bit capabilities of the CPU/GPU. Specifically, the way post-training quantization is implemented, supported layers are wrapped with quantize/de-quantizer operations, but the layers themselves are unchanged. They still operate on FP32 tensors, but these tensors are restricted to contain only integer values. So, as you can see, we are only adding operations in order to simulate quantization. Therefore it isn't surprising at all that you're getting slower runtime when quantizing. We do not have plans to implement "native" low-precision operations within Distiller. We do, however, plan to support exporting of quantized models using ONNX, once quantization is published as part of the standard. Then you'll be able to export a model quantized in Distiller and run it on a framework that supports actual 8-bit execution on CPU/GPU. In addition, 8-bit support in PyTorch itself is also in the works (it's already implemented in Caffe2). Once that's released, we'll see how we can integrate Distiller with that.