Slowdown between GPT4All-Chat 3.0 and GPT4All-Chat 3.1 #2889

ThiloteE · 2024-08-18T15:19:39Z

Bug Report

There was a noticeable slowdown of doing inference on LLMs. Something like 30-40% less tokens / second.
This change affected both CPU, Cuda and Vulkan backends.
This regression still has not been fixed GPT4All-Chat 3.2.1

Steps to Reproduce

Upgrade from GPT4All-Chat 3.0 to GPT4All-Chat 3.1

Expected Behavior

No slowdown.

Your Environment

GPT4All version: 3.2.1
Operating System: Windows 10
Chat model used (if applicable): For example llama-3-8b-instruct, but happens to all models, e.g. mistral-7b or phi-3-mini-instruct

Hypothesis for root cause

Here is the changelog for version 3.1: https://github.com/nomic-ai/gpt4all/releases/tag/v3.1.0

I highly suspect updating llama.cpp at #2694 introduced this regression, but who knows. Would need to do a git bisect.

3Simplex · 2024-08-19T15:44:58Z

I ran a small test on an instruction set over the versions seen in this image. Using temp 0.001 to avoid potential Temp-0 problem.

I found the response itself changes in these 3 versions. 2.7.5, 2.8.0, 3.1.0 when using CPU.

I was only talking about the generated responses, not speed here.

The following is a report on speed variance for each version.
I took the fastest reported speed on a single prompt regenerated a few times.

In version 2.7.5 the speed response was Vulkan GPU 45t/s then CPU 10t/s
In version 2.8.0 the speed response was Vulkan GPU 42t/s then CPU 9t/s
In version 3.0.0 the speed response was Vulkan GPU 41t/s then CPU 9.8t/s
In version 3.1.0 the speed response was Vulkan GPU 42t/s then CPU 9.2t/s
In version 3.1.1 the speed response was Vulkan GPU 41t/s then CPU 9.3t/s
In version 3.2.1 the speed response was Vulkan GPU 38t/s then CPU 9.7t/s

ThiloteE · 2024-08-19T17:26:55Z

What model? Maybe it is model related after all. Partial offloading or full offloading?
If it is GPU, I usually offload 19 layers partially.

I will do more precise testing later.

3Simplex · 2024-08-19T17:40:20Z

I was using the known good model, Llama 3 8b instruct. I used all layers on GPU. I used 8k context since that is the limit of the model. I don't like to offload, for me using CPU will usually work better if the model is too big for the GPU.

ThiloteE added chat gpt4all-chat issues bug-unconfirmed labels Aug 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slowdown between GPT4All-Chat 3.0 and GPT4All-Chat 3.1 #2889

Slowdown between GPT4All-Chat 3.0 and GPT4All-Chat 3.1 #2889

ThiloteE commented Aug 18, 2024 •

edited

Loading

3Simplex commented Aug 19, 2024

ThiloteE commented Aug 19, 2024 •

edited

Loading

3Simplex commented Aug 19, 2024

Slowdown between GPT4All-Chat 3.0 and GPT4All-Chat 3.1 #2889

Slowdown between GPT4All-Chat 3.0 and GPT4All-Chat 3.1 #2889

Comments

ThiloteE commented Aug 18, 2024 • edited Loading

Bug Report

Steps to Reproduce

Expected Behavior

Your Environment

Hypothesis for root cause

3Simplex commented Aug 19, 2024

ThiloteE commented Aug 19, 2024 • edited Loading

3Simplex commented Aug 19, 2024

ThiloteE commented Aug 18, 2024 •

edited

Loading

ThiloteE commented Aug 19, 2024 •

edited

Loading