Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowdown between GPT4All-Chat 3.0 and GPT4All-Chat 3.1 #2889

Open
ThiloteE opened this issue Aug 18, 2024 · 3 comments
Open

Slowdown between GPT4All-Chat 3.0 and GPT4All-Chat 3.1 #2889

ThiloteE opened this issue Aug 18, 2024 · 3 comments
Labels
bug-unconfirmed chat gpt4all-chat issues

Comments

@ThiloteE
Copy link
Collaborator

ThiloteE commented Aug 18, 2024

Bug Report

There was a noticeable slowdown of doing inference on LLMs. Something like 30-40% less tokens / second.
This change affected both CPU, Cuda and Vulkan backends.
This regression still has not been fixed GPT4All-Chat 3.2.1

Steps to Reproduce

Upgrade from GPT4All-Chat 3.0 to GPT4All-Chat 3.1

Expected Behavior

No slowdown.

Your Environment

  • GPT4All version: 3.2.1
  • Operating System: Windows 10
  • Chat model used (if applicable): For example llama-3-8b-instruct, but happens to all models, e.g. mistral-7b or phi-3-mini-instruct

Hypothesis for root cause

Here is the changelog for version 3.1: https://github.com/nomic-ai/gpt4all/releases/tag/v3.1.0

I highly suspect updating llama.cpp at #2694 introduced this regression, but who knows. Would need to do a git bisect.

@ThiloteE ThiloteE added chat gpt4all-chat issues bug-unconfirmed labels Aug 18, 2024
@3Simplex
Copy link
Collaborator

image
I ran a small test on an instruction set over the versions seen in this image. Using temp 0.001 to avoid potential Temp-0 problem.

I found the response itself changes in these 3 versions. 2.7.5, 2.8.0, 3.1.0 when using CPU.

  • I was only talking about the generated responses, not speed here.

The following is a report on speed variance for each version.
I took the fastest reported speed on a single prompt regenerated a few times.

  • In version 2.7.5 the speed response was Vulkan GPU 45t/s then CPU 10t/s
  • In version 2.8.0 the speed response was Vulkan GPU 42t/s then CPU 9t/s
  • In version 3.0.0 the speed response was Vulkan GPU 41t/s then CPU 9.8t/s
  • In version 3.1.0 the speed response was Vulkan GPU 42t/s then CPU 9.2t/s
  • In version 3.1.1 the speed response was Vulkan GPU 41t/s then CPU 9.3t/s
  • In version 3.2.1 the speed response was Vulkan GPU 38t/s then CPU 9.7t/s

@ThiloteE
Copy link
Collaborator Author

ThiloteE commented Aug 19, 2024

What model? Maybe it is model related after all. Partial offloading or full offloading?
If it is GPU, I usually offload 19 layers partially.

I will do more precise testing later.

@3Simplex
Copy link
Collaborator

I was using the known good model, Llama 3 8b instruct. I used all layers on GPU. I used 8k context since that is the limit of the model. I don't like to offload, for me using CPU will usually work better if the model is too big for the GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed chat gpt4all-chat issues
Projects
None yet
Development

No branches or pull requests

2 participants