Server: add test for num slots, fails on master #6950

JohannesGaessler · 2024-04-27T21:48:54Z

While working on #6828 I've been writing more tests to ensure that the results remain the same. However, while doing so I've noticed that for a given seed and a varying number of slots the results produced by the server are not deterministic. What I think is happening is that llama.cpp does not produce bit-for-bit identical results as the batch size is changed. Therefore, after some number of tokens two otherwise identical sequences randomly sample different tokens at which point they completely diverge.

I don't know if this can be fixed at all since the only way to get bit-for-bit identical results with floating point numbers is to do the exact same operations in the exact same order which would likely not yield good performance. Unless the CPU backend (which I used for testing) is supposed to produce bit-for-bit identical results in which case this would be indicative of a bug. In any case, feedback would be appreciated.

Sample outputs

content 0:  She was very proud of her story.
One day, she found an old cardboard box in her room. She was curious and decided to cut it in her project. She worked hard all day, cutting and tying until the box was ready.
Just then, her mom came into the room and said, "What are you doing, Sarah?"
Sarah smiled, "I'm cutting the box. I'm making a story about the truth!"
Her mom smiled and said, "That's very good. I'm so proud of you! Can I help you?"
Sarah nod
content 1:  She was very proud of her story.
One day, she found an old cardboard box in her room. She was curious and decided to cut it in her project. She worked hard all day, cutting and tying until the box was ready.
Just then, her mom came into the room and said, "What are you doing, Sarah?"
Sarah smiled, "I'm cutting the box. I'm making a story about the first one. Can you help me?"
Her mom smiled and said, "Of course! Let me see the surprise."
She took out a tube and showed

JohannesGaessler · 2024-04-27T22:03:20Z

I should mention, you run into this exact same problem if you have a fixed number of server slots but a varying number of parallel requests which I would argue is even more problematic.

JohannesGaessler · 2024-04-30T10:25:46Z

The sequences diverge for different batch sizes only if the temperature is high enough. I've added tests with temperature 0 and 1 and commented out those that currently fail on master. To assert that this is not a sampler issue I've expanded the tests around seeds: they now assert that the results are consistent with the same seed but different with different seeds.

I've changed the data type of context.seed. It is now a list of potentially different seeds (I think something like this was just not implemented). The interface for concurrent_requests now expects the user prompt as the first argument and the seed for the second argument (seed is not used for e.g. embeddings).

examples/server/tests/features/steps/steps.py

compilade · 2024-04-30T11:03:03Z

What I think is happening is that llama.cpp does not produce bit-for-bit identical results as the batch size is changed.

This is related to using a unified KV cache. See ggerganov/whisper.cpp#1941 (comment)

(I ran into this before in #6122 (comment))

phymbert · 2024-05-01T15:30:54Z

examples/server/tests/features/results.feature

    And   128 max tokens to predict
+    And   continuous batching


Minor: continuous batching is enabled by default (and cannot be disabled BTW :) )

phymbert

Thanks

JohannesGaessler mentioned this pull request Apr 28, 2024

Server: enable lookup decoding #6828

Open

JohannesGaessler force-pushed the server-test-num-slots branch from 3e8054e to 402f418 Compare April 30, 2024 10:20

JohannesGaessler requested a review from phymbert April 30, 2024 10:26

phymbert reviewed Apr 30, 2024

View reviewed changes

examples/server/tests/features/steps/steps.py Outdated Show resolved Hide resolved

JohannesGaessler force-pushed the server-test-num-slots branch from 402f418 to 122a0d1 Compare April 30, 2024 19:28

Server: add tests for batch size, different seeds

9ff8d4d

JohannesGaessler force-pushed the server-test-num-slots branch from 122a0d1 to 9ff8d4d Compare April 30, 2024 19:33

phymbert reviewed May 1, 2024

View reviewed changes

phymbert approved these changes May 1, 2024

View reviewed changes

JohannesGaessler merged commit 3ea0d36 into ggerganov:master May 1, 2024
24 checks passed

JohannesGaessler mentioned this pull request May 3, 2024

Non-deterministic output of the llama.cpp server when using multiple slots #7052

Closed

nopperl pushed a commit to nopperl/llama.cpp that referenced this pull request May 5, 2024

Server: add tests for batch size, different seeds (ggerganov#6950)

1d51ce4

teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this pull request May 7, 2024

Server: add tests for batch size, different seeds (ggerganov#6950)

c2b49e7

JohannesGaessler mentioned this pull request May 11, 2024

repeatability problem with CUDA backend #7228

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server: add test for num slots, fails on master #6950

Server: add test for num slots, fails on master #6950

JohannesGaessler commented Apr 27, 2024

JohannesGaessler commented Apr 27, 2024

JohannesGaessler commented Apr 30, 2024

compilade commented Apr 30, 2024 •

edited

Loading

phymbert May 1, 2024

phymbert left a comment

Server: add test for num slots, fails on master #6950

Server: add test for num slots, fails on master #6950

Conversation

JohannesGaessler commented Apr 27, 2024

JohannesGaessler commented Apr 27, 2024

JohannesGaessler commented Apr 30, 2024

compilade commented Apr 30, 2024 • edited Loading

phymbert May 1, 2024

Choose a reason for hiding this comment

phymbert left a comment

Choose a reason for hiding this comment

compilade commented Apr 30, 2024 •

edited

Loading