How should I verify the speedup effect of the algorithm? #15

moonlightian · 2023-07-14T07:05:12Z

As shown in paper, CUTLASS library is used for speedup. But I did not find codes rely on these settlement.How should I verify SparseGPT is faster than dense models when doing inference? Even with end-to-end, speedups would be slightly lower, that would be fine. Thanks a lot for your perfect works~

efrantar · 2023-07-17T09:26:08Z

Hi, SparseGPT itself is just concerned with accurately sparsifying a model; acceleration comes through other software / hardware that is able to exploit sparse models through speedup (such as 2:4 sparsity on Ampere GPUs). Our layer-wise 2:4 speedup measurements where produced directly with the prebuilt kernels available in NVIDIA's CUTLASS profiler. We compiled all the available kernels and then ran a benchmark sweep using this profiler (on an A100 GPU) for FP16/FP16 SpGEMMs of the appropriate matrix shapes. The result of this are the numbers we report. Observing those speedups during full inference will require integrating the corresponding CUTLASS kernels into PyTorch (Though, I think PyTorch is actually working on an official NVIDIA 2:4 integration, so hopefully actually running 2:4 models will be quite easy very soon.)

moonlightian · 2023-07-18T01:57:04Z

Hi, SparseGPT itself is just concerned with accurately sparsifying a model; acceleration comes through other software / hardware that is able to exploit sparse models through speedup (such as 2:4 sparsity on Ampere GPUs). Our layer-wise 2:4 speedup measurements where produced directly with the prebuilt kernels available in NVIDIA's CUTLASS profiler. We compiled all the available kernels and then ran a benchmark sweep using this profiler (on an A100 GPU) for FP16/FP16 SpGEMMs of the appropriate matrix shapes. The result of this are the numbers we report. Observing those speedups during full inference will require integrating the corresponding CUTLASS kernels into PyTorch (Though, I think PyTorch is actually working on an official NVIDIA 2:4 integration, so hopefully actually running 2:4 models will be quite easy very soon.)

Thank you for your kind reply~

moonlightian · 2023-07-21T06:53:18Z

Hi, SparseGPT itself is just concerned with accurately sparsifying a model; acceleration comes through other software / hardware that is able to exploit sparse models through speedup (such as 2:4 sparsity on Ampere GPUs). Our layer-wise 2:4 speedup measurements where produced directly with the prebuilt kernels available in NVIDIA's CUTLASS profiler. We compiled all the available kernels and then ran a benchmark sweep using this profiler (on an A100 GPU) for FP16/FP16 SpGEMMs of the appropriate matrix shapes. The result of this are the numbers we report. Observing those speedups during full inference will require integrating the corresponding CUTLASS kernels into PyTorch (Though, I think PyTorch is actually working on an official NVIDIA 2:4 integration, so hopefully actually running 2:4 models will be quite easy very soon.)

@efrantar Hi, following your introducement, I prepare an environment for NVIDIA's CUTLASS profiler and compiled kernels with official guide. As for "Observing those speedups during full inference will require integrating the corresponding CUTLASS kernels into PyTorch" mentioned above, I'm confused about how to make it work. Would that be convenient for you to offer some code for speedup testing? Or some links to NVIDIA related demo would be fine too. Thanks again

kiucho · 2023-08-02T02:59:25Z

Hi, I'm someone who wants to validate the speedup of 2:4 sparsification and density models.
As I understand it, to properly utilize SPMM (sparse matrix and dense matrix multiplication) on Nvidia's ampere architecture GPUs(like A6000 or A100), it is necessary to implement the cuSPARSELt library within Pytorch, which I think they are working on (cuSPARSELt Integration).
I have a few questions about this.

Does SparseGPT use the CUTLASS library only for speedup measurement, or does it also use it to approximate cuSPARSELt to do SPMM?
Finally, implementing a profiler within Pytorch seems to be a complex task that requires a deep understanding of both the Pytorch framework and the profiler. I would also be grateful if I could get the profiler and code for speedup.

I look forward to hearing from you. Thank you.

moonlightian closed this as completed Jul 18, 2023

efrantar mentioned this issue Jul 20, 2023

Inference Speedup #18

Open

moonlightian reopened this Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should I verify the speedup effect of the algorithm? #15

How should I verify the speedup effect of the algorithm? #15

moonlightian commented Jul 14, 2023

efrantar commented Jul 17, 2023

moonlightian commented Jul 18, 2023

moonlightian commented Jul 21, 2023

kiucho commented Aug 2, 2023

How should I verify the speedup effect of the algorithm? #15

How should I verify the speedup effect of the algorithm? #15

Comments

moonlightian commented Jul 14, 2023

efrantar commented Jul 17, 2023

moonlightian commented Jul 18, 2023

moonlightian commented Jul 21, 2023

kiucho commented Aug 2, 2023