Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about throughput #90

Closed
metalwhale opened this issue Jan 1, 2024 · 5 comments
Closed

Question about throughput #90

metalwhale opened this issue Jan 1, 2024 · 5 comments

Comments

@metalwhale
Copy link

metalwhale commented Jan 1, 2024

Hi. Thank you for publishing this repository. Congratulations on the excellent work and well-written paper.

The paper says Mamba has a higher throughput than a Transformers model. To check this, I made a simple test to measure the number of tokens per second generated for Mamba model and the original model of the TinyStories paper, which is GPTNeo-based. The results may vary slightly with each run, but I observed that GPTNeo consistently has much faster inference speed compared to Mamba.

Both models have the same 33M parameters, yet GPTNeo can generate ~6x times more tokens per second than Mamba. Could you provide some insights into why Mamba is slower in this case? Perhaps there's something I may have missed?

The results are reproducible, and you can find more details in this gist (I tested it on Google Colab):

Thank you in advance!

@tridao
Copy link
Collaborator

tridao commented Jan 1, 2024

Did you try our benchmark script in this repo? In particular the 33M model is tiny so CPU overhead matters a lot, you should pass cg=True to Mamba's generate function to reduce this CPU overhead.

@metalwhale
Copy link
Author

metalwhale commented Jan 2, 2024

@tridao
Thank you! After enabling cg=True I can confirm that the throughput of Mamba significantly improved. The latest results are provided here.

Average tokens per second

GPTNeo: 202.30779063768307
Mamba (cg off): 26.92385745662631
Mamba (cg on): 384.8630853054232

May I ask another noob question? In which section of the Mamba paper is the "cg" option mentioned? I would like to read more about it to gain a deeper understanding. I am concerned because I wonder if it is still fair if Transformers model does not have the "cg" option .

Once again, thank you for your excellent work and for sharing! This is truly awesome.

@tridao
Copy link
Collaborator

tridao commented Jan 2, 2024

It's an implementation detail to reduce CPU overhead. You can read the implementation here if you'd like.

@metalwhale
Copy link
Author

Thank you. Let me take a look.

@abdulfatir
Copy link

abdulfatir commented Feb 17, 2024

@metalwhale did you get a chance to look into cg more deeply? If yes, would you be able to provide a high-level summary.

@tridao I trained a mamba model on my own data. However, I am facing an issue where passing cg=True results in the model generating garbage output. The first token looks reasonable but after that the model goes to an arbitrary token and then just keeps repeating it. When I set cg=False, the model generates normal output. Do you have any thoughts on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants