Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server: add test for num slots, fails on master #6950

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

While working on #6828 I've been writing more tests to ensure that the results remain the same. However, while doing so I've noticed that for a given seed and a varying number of slots the results produced by the server are not deterministic. What I think is happening is that llama.cpp does not produce bit-for-bit identical results as the batch size is changed. Therefore, after some number of tokens two otherwise identical sequences randomly sample different tokens at which point they completely diverge.

I don't know if this can be fixed at all since the only way to get bit-for-bit identical results with floating point numbers is to do the exact same operations in the exact same order which would likely not yield good performance. Unless the CPU backend (which I used for testing) is supposed to produce bit-for-bit identical results in which case this would be indicative of a bug. In any case, feedback would be appreciated.

Sample outputs
content 0:  She was very proud of her story.
One day, she found an old cardboard box in her room. She was curious and decided to cut it in her project. She worked hard all day, cutting and tying until the box was ready.
Just then, her mom came into the room and said, "What are you doing, Sarah?"
Sarah smiled, "I'm cutting the box. I'm making a story about the truth!"
Her mom smiled and said, "That's very good. I'm so proud of you! Can I help you?"
Sarah nod
content 1:  She was very proud of her story.
One day, she found an old cardboard box in her room. She was curious and decided to cut it in her project. She worked hard all day, cutting and tying until the box was ready.
Just then, her mom came into the room and said, "What are you doing, Sarah?"
Sarah smiled, "I'm cutting the box. I'm making a story about the first one. Can you help me?"
Her mom smiled and said, "Of course! Let me see the surprise."
She took out a tube and showed

@JohannesGaessler
Copy link
Collaborator Author

I should mention, you run into this exact same problem if you have a fixed number of server slots but a varying number of parallel requests which I would argue is even more problematic.

@JohannesGaessler
Copy link
Collaborator Author

The sequences diverge for different batch sizes only if the temperature is high enough. I've added tests with temperature 0 and 1 and commented out those that currently fail on master. To assert that this is not a sampler issue I've expanded the tests around seeds: they now assert that the results are consistent with the same seed but different with different seeds.

I've changed the data type of context.seed. It is now a list of potentially different seeds (I think something like this was just not implemented). The interface for concurrent_requests now expects the user prompt as the first argument and the seed for the second argument (seed is not used for e.g. embeddings).

@compilade
Copy link
Collaborator

compilade commented Apr 30, 2024

What I think is happening is that llama.cpp does not produce bit-for-bit identical results as the batch size is changed.

This is related to using a unified KV cache. See ggerganov/whisper.cpp#1941 (comment)

(I ran into this before in #6122 (comment))

And 128 max tokens to predict
And continuous batching
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: continuous batching is enabled by default (and cannot be disabled BTW :) )

Copy link
Collaborator

@phymbert phymbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@JohannesGaessler JohannesGaessler merged commit 3ea0d36 into ggerganov:master May 1, 2024
24 checks passed
nopperl pushed a commit to nopperl/llama.cpp that referenced this pull request May 5, 2024
teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants