Question about throughput #90

metalwhale · 2024-01-01T13:46:18Z

Hi. Thank you for publishing this repository. Congratulations on the excellent work and well-written paper.

The paper says Mamba has a higher throughput than a Transformers model. To check this, I made a simple test to measure the number of tokens per second generated for Mamba model and the original model of the TinyStories paper, which is GPTNeo-based. The results may vary slightly with each run, but I observed that GPTNeo consistently has much faster inference speed compared to Mamba.

Both models have the same 33M parameters, yet GPTNeo can generate ~6x times more tokens per second than Mamba. Could you provide some insights into why Mamba is slower in this case? Perhaps there's something I may have missed?

The results are reproducible, and you can find more details in this gist (I tested it on Google Colab):

Thank you in advance!

tridao · 2024-01-01T19:22:19Z

Did you try our benchmark script in this repo? In particular the 33M model is tiny so CPU overhead matters a lot, you should pass cg=True to Mamba's generate function to reduce this CPU overhead.

metalwhale · 2024-01-02T04:50:10Z

@tridao
Thank you! After enabling cg=True I can confirm that the throughput of Mamba significantly improved. The latest results are provided here.

Average tokens per second

GPTNeo: 202.30779063768307
Mamba (cg off): 26.92385745662631
Mamba (cg on): 384.8630853054232

May I ask another noob question? In which section of the Mamba paper is the "cg" option mentioned? I would like to read more about it to gain a deeper understanding. I am concerned because I wonder if it is still fair if Transformers model does not have the "cg" option .

Once again, thank you for your excellent work and for sharing! This is truly awesome.

tridao · 2024-01-02T04:53:32Z

It's an implementation detail to reduce CPU overhead. You can read the implementation here if you'd like.

metalwhale · 2024-01-02T04:57:30Z

Thank you. Let me take a look.

abdulfatir · 2024-02-17T14:23:46Z

@metalwhale did you get a chance to look into cg more deeply? If yes, would you be able to provide a high-level summary.

@tridao I trained a mamba model on my own data. However, I am facing an issue where passing cg=True results in the model generating garbage output. The first token looks reasonable but after that the model goes to an arbitrary token and then just keeps repeating it. When I set cg=False, the model generates normal output. Do you have any thoughts on this?

metalwhale closed this as completed Jan 2, 2024

y1xia0w mentioned this issue Mar 17, 2024

mamba generation throughput lower than original due to DecodingCGCache huggingface/transformers#29699

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about throughput #90

Question about throughput #90

metalwhale commented Jan 1, 2024 •

edited

Loading

tridao commented Jan 1, 2024

metalwhale commented Jan 2, 2024 •

edited

Loading

tridao commented Jan 2, 2024

metalwhale commented Jan 2, 2024

abdulfatir commented Feb 17, 2024 •

edited

Loading

Question about throughput #90

Question about throughput #90

Comments

metalwhale commented Jan 1, 2024 • edited Loading

tridao commented Jan 1, 2024

metalwhale commented Jan 2, 2024 • edited Loading

tridao commented Jan 2, 2024

metalwhale commented Jan 2, 2024

abdulfatir commented Feb 17, 2024 • edited Loading

metalwhale commented Jan 1, 2024 •

edited

Loading

metalwhale commented Jan 2, 2024 •

edited

Loading

abdulfatir commented Feb 17, 2024 •

edited

Loading