Frequently Asked Questions (FAQ)

In this page we answer frequently asked questions.

Q: I pruned my model using an element-wise pruner (also known as fine-pruning) and the size of the weights tensors is not reduced. What's going on?

A: There are different types of sparsity patterns, with element-wise sparsity being the simplest case. When you perform fine-grained pruning, you produce tensors that are sparse at the element granularity. The weight tensors are not reduced in size because the zero-coefficients are still present in the tensor. Some NN accelerators (ASICs) take advantage of fine-grained sparsity by using a compressed representation of sparse tensors. An Investigation of Sparse Tensor Formats for Tensor Libraries provides a review of some of these representations, such as the Compressed Sparse Row format. When sparse weight tensors are represented using a compact format, they are stored in memory using the compact format which reduces the bandwidth and power required to fetch them into the neural processing unit. Once the compact tensor is read (in full, or in part) it can be converted back to a dense tensor format, to perform the neural compute operation. A further acceleration is achieved if the hardware can instead perform the compute operation directly using the tensor in its compact representation.
The diagram below shows fine-grain sparsity in comparison to other sparsity patterns (source: Exploring the Regularity of Sparse Structure in Convolutional Neural Networks which provides an exploration of the different types of sparsity patterns).

sparsity patterns

Q: I pruned my model using an element-wise pruner and I don't see an improvement in the run-time. What's going on?

A: The answer to the question above explains the necessity to use specialized hardware to see a performance gain from fine-grained weights sparsity. Currently the PyTorch software stack does not support sparse tensor representations in the main NN operations (e.g. Convolution and GEMM) so even with the best hardware, you can only see a performance boost when exporting the PyTorch model to ONNX, and executing the ONNX model on hardware with support for sparse representation.
Q: I pruned my model using an structure pruner and I don't see an improvement in the run-time. What's going on?

A: Different hardware benefit from different sparsity patterns.