Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should I verify the speedup effect of the algorithm? #15

Open
moonlightian opened this issue Jul 14, 2023 · 4 comments
Open

How should I verify the speedup effect of the algorithm? #15

moonlightian opened this issue Jul 14, 2023 · 4 comments

Comments

@moonlightian
Copy link

As shown in paper, CUTLASS library is used for speedup. But I did not find codes rely on these settlement.How should I verify SparseGPT is faster than dense models when doing inference? Even with end-to-end, speedups would be slightly lower, that would be fine. Thanks a lot for your perfect works~

@efrantar
Copy link
Member

Hi, SparseGPT itself is just concerned with accurately sparsifying a model; acceleration comes through other software / hardware that is able to exploit sparse models through speedup (such as 2:4 sparsity on Ampere GPUs). Our layer-wise 2:4 speedup measurements where produced directly with the prebuilt kernels available in NVIDIA's CUTLASS profiler. We compiled all the available kernels and then ran a benchmark sweep using this profiler (on an A100 GPU) for FP16/FP16 SpGEMMs of the appropriate matrix shapes. The result of this are the numbers we report. Observing those speedups during full inference will require integrating the corresponding CUTLASS kernels into PyTorch (Though, I think PyTorch is actually working on an official NVIDIA 2:4 integration, so hopefully actually running 2:4 models will be quite easy very soon.)

@moonlightian
Copy link
Author

Hi, SparseGPT itself is just concerned with accurately sparsifying a model; acceleration comes through other software / hardware that is able to exploit sparse models through speedup (such as 2:4 sparsity on Ampere GPUs). Our layer-wise 2:4 speedup measurements where produced directly with the prebuilt kernels available in NVIDIA's CUTLASS profiler. We compiled all the available kernels and then ran a benchmark sweep using this profiler (on an A100 GPU) for FP16/FP16 SpGEMMs of the appropriate matrix shapes. The result of this are the numbers we report. Observing those speedups during full inference will require integrating the corresponding CUTLASS kernels into PyTorch (Though, I think PyTorch is actually working on an official NVIDIA 2:4 integration, so hopefully actually running 2:4 models will be quite easy very soon.)

Thank you for your kind reply~

@moonlightian
Copy link
Author

Hi, SparseGPT itself is just concerned with accurately sparsifying a model; acceleration comes through other software / hardware that is able to exploit sparse models through speedup (such as 2:4 sparsity on Ampere GPUs). Our layer-wise 2:4 speedup measurements where produced directly with the prebuilt kernels available in NVIDIA's CUTLASS profiler. We compiled all the available kernels and then ran a benchmark sweep using this profiler (on an A100 GPU) for FP16/FP16 SpGEMMs of the appropriate matrix shapes. The result of this are the numbers we report. Observing those speedups during full inference will require integrating the corresponding CUTLASS kernels into PyTorch (Though, I think PyTorch is actually working on an official NVIDIA 2:4 integration, so hopefully actually running 2:4 models will be quite easy very soon.)

@efrantar Hi, following your introducement, I prepare an environment for NVIDIA's CUTLASS profiler and compiled kernels with official guide. As for "Observing those speedups during full inference will require integrating the corresponding CUTLASS kernels into PyTorch" mentioned above, I'm confused about how to make it work. Would that be convenient for you to offer some code for speedup testing? Or some links to NVIDIA related demo would be fine too. Thanks again

@moonlightian moonlightian reopened this Jul 21, 2023
@kiucho
Copy link

kiucho commented Aug 2, 2023

Hi, I'm someone who wants to validate the speedup of 2:4 sparsification and density models.
As I understand it, to properly utilize SPMM (sparse matrix and dense matrix multiplication) on Nvidia's ampere architecture GPUs(like A6000 or A100), it is necessary to implement the cuSPARSELt library within Pytorch, which I think they are working on (cuSPARSELt Integration).
I have a few questions about this.

  1. Does SparseGPT use the CUTLASS library only for speedup measurement, or does it also use it to approximate cuSPARSELt to do SPMM?

  2. Finally, implementing a profiler within Pytorch seems to be a complex task that requires a deep understanding of both the Pytorch framework and the profiler. I would also be grateful if I could get the profiler and code for speedup.

I look forward to hearing from you. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants