Grouped GEMM

A lighweight library exposing grouped GEMM kernels in PyTorch.

Installation

Run pip install grouped_gemm to install the package.

Compiling from source

By default, the installed package runs in conservative (cuBLAS) mode: it launches one GEMM kernel per batch element instead of using a single grouped GEMM kernel for the whole batch.

To enable using grouped GEMM kernels, you need to switch to the CUTLASS mode by setting the GROUPED_GEMM_CUTLASS environment variable to 1 when building the library. For example, to build the library in CUTLASS mode for Ampere (SM 8.0), clone the repository and run the following:

$ TORCH_CUDA_ARCH_LIST=8.0 GROUPED_GEMM_CUTLASS=1 pip install .

See this comment for some performance measurements on A100 and H100.

Upcoming features

Running grouped GEMM kernels without GPU<->CPU synchronization points.
Hopper-optimized grouped GEMM kernels.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
csrc		csrc
grouped_gemm		grouped_gemm
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grouped GEMM

Installation

Compiling from source

Upcoming features

About

Releases 4

Packages

Contributors 4

Languages

License

tgale96/grouped_gemm

Folders and files

Latest commit

History

Repository files navigation

Grouped GEMM

Installation

Compiling from source

Upcoming features

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 4

Languages

Packages