Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : fix segfault on long system prompt #8987

Merged
merged 4 commits into from
Aug 14, 2024

Conversation

compilade
Copy link
Collaborator

Should fix #8975 (reported by @pritam-dey3).

The problem was introduced in #6017 by limiting the size of the llama_batch used in the server to n_batch instead of n_ctx, while not updating the logic managing the system prompt processing to not put all the system prompt in the llama_batch.

This fix limits how many tokens of the system prompt are put into the llama_batch of the server at a time so that it never exceeds n_batch, while still allowing larger system prompts by splitting them across batches.

cc @slaren


@compilade compilade added bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix server labels Aug 11, 2024
@pritam-dey3
Copy link

pritam-dey3 commented Aug 11, 2024

Hello @compilade, I was trying to fix it as well, looks like you beat me to it 😅. Won't my fix be a little faster since it avoids the nested loop?

Aside from that I have a small question. If clearing the batch doesn't clear the kv cache then this pr looks good to me👍

@compilade
Copy link
Collaborator Author

Hello @compilade, I was trying to fix it as well, looks like you beat me to it 😅.

@pritam-dey3 Nice to see you were also trying to fix the problem you found :)

Won't my fix be a little faster since it avoids the nested loop?

The copies of the system prompt tokens into the batch are done anyway, so I don't see why the nested loop would be slower. It's simply chunking the copy n_batch tokens at a time instead of doing it all up-front.

(Also, calling llama_batch_init each time the system prompt is updated without calling llama_batch_free on the previous llama_batch contained in batch will cause a memory leak. The approach I took here re-uses the same batch buffers every time.)


for (int32_t i = 0; i < batch.n_tokens; i += n_batch) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there is some outdated logic in update_slots as well, because there is a similar loop there.

Copy link
Collaborator Author

@compilade compilade Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only problem I see in update_slots is when the batch size is smaller than the number of slots, in which case tokens are added to the batch anyway, which could be out-of-bounds.

llama_batch_add(batch, slot.sampled, system_tokens.size() + slot_npast, { slot.id + 1 }, true);

When the batch size is bigger than the number of slots, it does not seem to have the same problem as the system prompt, because there is already never more than n_batch tokens added to a batch:

// add prompt tokens for processing in the current batch
// TODO: the self-extend stuff here is a mess - simplify and/or abstract it somehow
for (; slot.n_past < slot.n_prompt_tokens && batch.n_tokens < n_batch; ++slot.n_past) {
if (slot.ga_n != 1) {
while (slot_npast >= ga_i + ga_w) {
const int bd = (ga_w/ga_n)*(ga_n - 1);
slot_npast -= bd;
ga_i += ga_w/ga_n;
}
}
llama_batch_add(batch, prompt_tokens[slot.n_past], system_tokens.size() + slot_npast, { slot.id + 1 }, false);

So yes, this loop:

for (int32_t i = 0; i < batch.n_tokens; i += n_batch) {

might be a bit outdated, but I think it's still relevant for the error handling logic:

// retry with half the batch size to try to find a free slot in the KV cache
n_batch /= 2;

From the above I think what could be fixed is parallel generation with small batch sizes:

diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index a1bbafbc..b2bf5bb5 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -753,13 +753,13 @@ struct server_context {
         default_generation_settings_for_props = get_formated_generation(slots.front());
         default_generation_settings_for_props["seed"] = -1;
 
-        // the update_slots() logic will always submit a maximum of n_batch tokens
+        // the update_slots() logic will always submit a maximum of n_batch or n_parallel tokens
         // note that n_batch can be > n_ctx (e.g. for non-causal attention models such as BERT where the KV cache is not used)
         {
             const int32_t n_batch = llama_n_batch(ctx);
 
             // only a single seq_id per token is needed
-            batch = llama_batch_init(n_batch, 0, 1);
+            batch = llama_batch_init(std::max(n_batch, params.n_parallel), 0, 1);
         }
 
         metrics.init();

Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very familiar with the server code, but the fix looks good to me.

@ggerganov ggerganov merged commit 98a532d into master Aug 14, 2024
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug examples Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: llama-server with --system-prompt-file stops abruptly without any error
4 participants