Make MegaBlocks go vroom on Hopper. #24

tgale96 · 2023-09-23T18:19:39Z

Add grouped GEMM-based dMoE to work around Triton limitations on SM90. Guard turbo use to we do not need it installed if quantization is not enabled. Add layer-wise dMoE benchmarks.

After this PR, we recommend enabling grouped_mlp for SM90. grouped_mlp should be used only with expert model parallelism to keep per-device expert counts low, which is important for efficiency with the current cuBLAS-based grouped GEMM kernels.

…availability.

tgale96 · 2023-09-23T19:22:27Z

dMoE benchmarks on 8x H100 with 8-way EMP:

============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 2
sequence_length = 2048
hidden_size = 2048
ffn_hidden_size = 2048
num_experts = 32
top_k = 4
grouped_mlp = False
Results:
mean time = 4.263ms, std time = 3.202ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 2
sequence_length = 2048
hidden_size = 2048
ffn_hidden_size = 2048
num_experts = 32
top_k = 4
grouped_mlp = True
Results:
mean time = 3.605ms, std time = 3.911ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 4
sequence_length = 2048
hidden_size = 2048
ffn_hidden_size = 2048
num_experts = 32
top_k = 4
grouped_mlp = False
Results:
mean time = 7.239ms, std time = 5.606ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 4
sequence_length = 2048
hidden_size = 2048
ffn_hidden_size = 2048
num_experts = 32
top_k = 4
grouped_mlp = True
Results:
mean time = 6.690ms, std time = 4.307ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 2
sequence_length = 2048
hidden_size = 2560
ffn_hidden_size = 2560
num_experts = 32
top_k = 4
grouped_mlp = False
Results:
mean time = 5.165ms, std time = 4.151ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 2
sequence_length = 2048
hidden_size = 2560
ffn_hidden_size = 2560
num_experts = 32
top_k = 4
grouped_mlp = True
Results:
mean time = 4.092ms, std time = 3.154ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 4
sequence_length = 2048
hidden_size = 2560
ffn_hidden_size = 2560
num_experts = 32
top_k = 4
grouped_mlp = False
Results:
mean time = 8.410ms, std time = 5.480ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 4
sequence_length = 2048
hidden_size = 2560
ffn_hidden_size = 2560
num_experts = 32
top_k = 4
grouped_mlp = True
Results:
mean time = 7.575ms, std time = 4.554ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 2
sequence_length = 2048
hidden_size = 4096
ffn_hidden_size = 4096
num_experts = 32
top_k = 4
grouped_mlp = False
Results:
mean time = 7.288ms, std time = 3.739ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 2
sequence_length = 2048
hidden_size = 4096
ffn_hidden_size = 4096
num_experts = 32
top_k = 4
grouped_mlp = True
Results:
mean time = 5.638ms, std time = 3.959ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 4
sequence_length = 2048
hidden_size = 4096
ffn_hidden_size = 4096
num_experts = 32
top_k = 4
grouped_mlp = False
Results:
mean time = 13.633ms, std time = 4.487ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 4
sequence_length = 2048
hidden_size = 4096
ffn_hidden_size = 4096
num_experts = 32
top_k = 4
grouped_mlp = True
Results:
mean time = 10.527ms, std time = 3.780ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 2
sequence_length = 2048
hidden_size = 5120
ffn_hidden_size = 5120
num_experts = 32
top_k = 4
grouped_mlp = False
Results:
mean time = 9.172ms, std time = 4.656ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 2
sequence_length = 2048
hidden_size = 5120
ffn_hidden_size = 5120
num_experts = 32
top_k = 4
grouped_mlp = True
Results:
mean time = 7.209ms, std time = 4.374ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 4
sequence_length = 2048
hidden_size = 5120
ffn_hidden_size = 5120
num_experts = 32
top_k = 4
grouped_mlp = False
Results:
mean time = 17.286ms, std time = 5.826ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 4
sequence_length = 2048
hidden_size = 5120
ffn_hidden_size = 5120
num_experts = 32
top_k = 4
grouped_mlp = True
Results:
mean time = 12.779ms, std time = 5.501ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 2
sequence_length = 2048
hidden_size = 7168
ffn_hidden_size = 7168
num_experts = 32
top_k = 4
grouped_mlp = False
Results:
mean time = 14.249ms, std time = 4.010ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 2
sequence_length = 2048
hidden_size = 7168
ffn_hidden_size = 7168
num_experts = 32
top_k = 4
grouped_mlp = True
Results:
mean time = 10.088ms, std time = 3.611ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 4
sequence_length = 2048
hidden_size = 7168
ffn_hidden_size = 7168
num_experts = 32
top_k = 4
grouped_mlp = False
Results:
mean time = 28.500ms, std time = 6.034ms
============================================================
============================================================
dMoE (Fwd) Benchmark
Benchmark Parameters:
batch_size = 4
sequence_length = 2048
hidden_size = 7168
ffn_hidden_size = 7168
num_experts = 32
top_k = 4
grouped_mlp = True
Results:
mean time = 19.245ms, std time = 5.647ms
============================================================

tgale96 added 9 commits September 19, 2023 15:00

Adding yaml for matmul benchmark on mcloud. Fixed benchmark.

61b23e3

Adding benchmark script for Triton H100 matmul.

dba2db9

Integrate grouped gemm into matmul benchmarks. Guard turbo import on …

5b02527

…availability.

Update matmul benchmarks

7572c60

Add back other matmul problems.

c23fd65

Adding support for GroupedMLP.

399b09d

Clean up yamls and matmul benchmarks.

005c683

Add layer-wise benchmarks for dMoE

feb1f24

Add more micro batch sizes to benchmark. Add script for mcloud.

e8a9937

tgale96 merged commit fbc8851 into main Sep 25, 2023

mvpatel2000 deleted the mcloud branch November 6, 2023 20:07

casper-hansen mentioned this pull request Jan 3, 2024

[BOUNTY] Optimized Triton Kernels for full fine tunes axolotl-ai-cloud/axolotl#1038

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make MegaBlocks go vroom on Hopper. #24

Make MegaBlocks go vroom on Hopper. #24

tgale96 commented Sep 23, 2023 •

edited

Loading

tgale96 commented Sep 23, 2023 •

edited

Loading

Make MegaBlocks go vroom on Hopper. #24

Make MegaBlocks go vroom on Hopper. #24

Conversation

tgale96 commented Sep 23, 2023 • edited Loading

tgale96 commented Sep 23, 2023 • edited Loading

tgale96 commented Sep 23, 2023 •

edited

Loading

tgale96 commented Sep 23, 2023 •

edited

Loading