Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PostMessage: Data cannot be cloned, out of memory #12

Open
flatsiedatsie opened this issue May 4, 2024 · 23 comments
Open

PostMessage: Data cannot be cloned, out of memory #12

flatsiedatsie opened this issue May 4, 2024 · 23 comments
Labels
bug Something isn't working

Comments

@flatsiedatsie
Copy link
Contributor

I'm trying to load Mistral 7B 32K. I've chunked the 4.3GB model and uploaded it to huggingface.

When the download is seemingly complete, there is a warning about being out of memory:

Screenshot 2024-05-04 at 18 36 24

It's a little odd, as I normally load bigger chunked models (Llama 8B) with WebLLM. The task manager also indicates that memory pressure is medium.

@ngxson
Copy link
Owner

ngxson commented May 4, 2024

Yes seems like the issue is due to the way multiple files are being copied onto web worker: we're currently copying all shards at once, which may cause it to run out of memory. The fix would be:

  • Copy one file at once ==> This is the easy fix
  • Even better, we can move download function to webworker, so no copy is needed (but with the cost of making harder to control) ==> The hard way to fix
  • Maybe use SharedArrayBuffer when possible to avoid copy ==> Should be easy to implement

@flatsiedatsie
Copy link
Contributor Author

flatsiedatsie commented May 8, 2024

This is becoming a bit of a show stopper unfortunately. It seems to even affect small models that would load under llama_cpp_wasm , such as NeuralReyna :-(

If you could help fix this issue, or give some pointers on how I could attempt to do so myself, that would be greatly appreciated. At this point I don't mind if a fix is slow or sub-optimal. I just want wllama to be reliable.

@ngxson
Copy link
Owner

ngxson commented May 8, 2024

I'm planning to work on this issue in next days. It maybe more complicated than it looks, so I'll need time to figure that out. Please be patient.

@flatsiedatsie
Copy link
Contributor Author

That's great news! Thank you so much!

@ngxson
Copy link
Owner

ngxson commented May 10, 2024

FYI, v1.7.0 has been released. It also come with support for progressCallback, please see "advanced" example:

await wllama.loadModelFromUrl(MODEL, {
embeddings: true,
n_ctx: 1024,
progressCallback: ({ loaded, total }) => console.log(`Downloading... ${Math.round(loaded/total*100)}%`),
});

This issue (out-of-memory) is hopefully fixed by #14 , but I'm not 100% sure. Please try again & let me know if it works.

Also, it's now recommended to split the model into chunks of 256MB or 512MB. Again, see "advanced" example:

// Or, try loading a bigger model (1.3GB in total)
/*const MODEL_SPLITS = [
'https://huggingface.co/ngxson/test_gguf_models/resolve/main/neuralreyna-mini-1.8b-v0.3.q4_k_m-00001-of-00005.gguf',
'https://huggingface.co/ngxson/test_gguf_models/resolve/main/neuralreyna-mini-1.8b-v0.3.q4_k_m-00002-of-00005.gguf',
'https://huggingface.co/ngxson/test_gguf_models/resolve/main/neuralreyna-mini-1.8b-v0.3.q4_k_m-00003-of-00005.gguf',
'https://huggingface.co/ngxson/test_gguf_models/resolve/main/neuralreyna-mini-1.8b-v0.3.q4_k_m-00004-of-00005.gguf',
'https://huggingface.co/ngxson/test_gguf_models/resolve/main/neuralreyna-mini-1.8b-v0.3.q4_k_m-00005-of-00005.gguf',
];*/

Also have a look at updated README: https://github.com/ngxson/wllama/tree/master?tab=readme-ov-file#prepare-your-model

Thank you!

@flatsiedatsie
Copy link
Contributor Author

flatsiedatsie commented May 11, 2024

The readme mentions the progress feature (very nice bonus, thank you!), but just to be sure: does this also address the memory issue? Or is the intended fix for that to make the chunks smaller?

Ah, reading again..

Also, it's now recommended to split the model into chunks of 256MB or 512MB.

OK, I'll do that. Thank you.

@flatsiedatsie
Copy link
Contributor Author

I'm seeing this error after creating a chunked model of Open Buddy Mistral 7B 32k Q4_K_M with 50 x 100Mb chunks:

Screenshot 2024-05-12 at 00 09 19 Screenshot 2024-05-12 at 00 13 20
		"download_url":[
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00001-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00002-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00003-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00004-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00005-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00006-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00007-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00008-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00009-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00010-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00011-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00012-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00013-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00014-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00015-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00016-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00017-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00018-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00019-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00020-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00021-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00022-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00023-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00024-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00025-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00026-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00027-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00028-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00029-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00030-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00031-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00032-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00033-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00034-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00035-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00036-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00037-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00038-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00039-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00040-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00041-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00042-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00043-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00044-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00045-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00046-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00047-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00048-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00049-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00050-of-00050.gguf",
		],

@ngxson
Copy link
Owner

ngxson commented May 12, 2024

I'm seeing this error after creating a chunked model of Open Buddy Mistral 7B 32k Q4_K_M with 50 x 100Mb chunks:

@flatsiedatsie FYI, I uploaded v1.8.0 which should display a better error message (I don't know if it fixes the mentioned issue or not). Could you try again and see what's the error? Thanks.

@flatsiedatsie
Copy link
Contributor Author

I will. I've been trying lots of things actually. But unfortunately still having trouble loading models that WebLLM does load.

The following screenshots are not so much bugs, I've managed to solve some (basically by reducing context size).

Screenshot 2024-05-13 at 00 14 11

Screenshot 2024-05-13 at 00 14 33

Screenshot 2024-05-13 at 00 15 06 Screenshot 2024-05-13 at 00 47 11

Screenshot 2024-05-13 at 00 52 58

Screenshot 2024-05-13 at 00 53 24

Screenshot 2024-05-13 at 00 53 40

@flatsiedatsie
Copy link
Contributor Author

flatsiedatsie commented May 13, 2024

I'm also still looking into your suggestion that it may be that the model is trying to load twice.

@ngxson
Copy link
Owner

ngxson commented May 13, 2024

You screenshot still shows "_wllama_decode_exception", which is already been removed in 1.8.0. Maybe your code is not using the latest version.

@flatsiedatsie
Copy link
Contributor Author

Correct, those are screenshots from yesterdy. I'm updating it now.

@flatsiedatsie
Copy link
Contributor Author

flatsiedatsie commented May 13, 2024

OK, I've done some more testing. TL/DR: Thing are running a lot smoother now! It's just the big models or big contexts that run out of memory.

But before I get into that, let me give a little context about what I'm trying to achieve. I'm trying to create a 100% browser-based online tool where people can not only chat with AI, but use it to work on documents. For that I need two types of models:

  1. A small model with a huge context for summarization tasks.
  • Small: Danube with 8K context is great for memory-poor mobile phones.
  • Medium: NeuralReyna, is a step up, as it has a 32K context.
  • Large: Phi 3 with 128K context.
  1. Use a large model with a relatively small context for more demanding tasks, like rewriting part of a document in a different tone.
  • Small: I'm not sure yet.
  • Medium: Mistral 7B with 4K
  • Large: Llama 3 8B is the top of the line.

Mistral 7B with 32K context could be a good "middle of the road do-it-all" option, so I've been trying to run that with Wllama today.

I started with by using your example code to eliminate the possiblity of bugs in my project being the cause of issues. I also rebooted my laptop first (Macbook Pro with 16Gb of ram) to have as much available memory as possible. Once I found that I got the same results with the example as with my code, I mostly reverted back to my project.

  1. Qwen 0.5 (GGUF)

The only model I've been able to get to work with 16K context. Crashes on it's theoretical maximum, 32K.
Screenshot 2024-05-13 at 21 17 40

  1. NeuralReyna

In my main code I can now load NeuralReyna. Howver, if I try to use the full 32K, or even 16K, there are once again memory issues. With 8K it doesn't crash.

  1. Phi 3 - 4K

I chunked it in 250Mb parts, and it loads! Nice!

  1. Phi 3 - 128K

Here I tried to directly load a 1.96Gb .gguf file (Q3_K) and even that worked! This is pretty great, as Llama.cpp support for this model is right around the corner.

To be clear, I used it with a 4K context, since Llama.cpp doesn't support bigger context yet.

  1. Mistral 7B - 32K

This model has memory issues. To make sure it wasn't my code I tried loading the model in the advanced example too. Same result. Even setting the context to 1K doesn't help. The chunks I'm using are available here: https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked

With version 1.8 Wllama doesn't seem to raise an error though? It just just states the issue in the console. But my code thinks the model has loaded OK, even though it hasn't. Is there a way to get the failed state?

Screenshot 2024-05-13 at 18 56 48

In summary, only the bigger models/contexts now seem to run into issues.

  • You could argue the model is just "too big". But from using WebLLM I know that it should be possible to run it in the browser, with memory to spare. Similarly, the even bigger Llama 3 8B 4K can run under WebLLM. And since the Mac's have unified memory, I can't blame it on WebLLM offloading it to graphics card memory, right?

  • You could argue "Well, run Mistral 7B through WebLLM then". But WebLLM only runs when a WebGPU is available. It would be awesome to seemlessly switch between Wllama and WebLLM in the background, depending on WebGPU support.

I stilll have to test what happens on devices with less memory (e.g. 8Gb Macbook Air).

Finally, I just want to say: thank you for all your work on this! It's such an exciting development. Everybody talks about bringing AI to the masses, but too few people realize the browser is the best platform to do that with. Wllama is awesome!

@flatsiedatsie
Copy link
Contributor Author

flatsiedatsie commented May 13, 2024

Just a quick check:

  • Is it reasonable to set the n_ctx and n_seq_max to the same value? In the advanced example you only seem to set n_ctx. Do you recommend doing the same?
  • Is it reasonable to have n_batch hardcoded at 1024?

@ngxson
Copy link
Owner

ngxson commented May 13, 2024

Thank you for the very detailed info!

It's true that we will definitely be struggle with the memory issue, because AFAIK browsers does have some limits on memory usage. Optimizing memory usage will surely be an area that I'll need to invest my time into.

Howver, if I try to use the full 32K, or even 16K, there are once again memory issues. With 8K it doesn't crash.

FYI, n_ctx doesn't have to be power of 2. It can be multiple of 1024, for example 10 * 1024 (= 10K)

Another trick to reduce memory usage is by using quantize q4_0 for cache_type_k, for example:

wllama.loadModelFromUrl(MODEL, {
  n_ctx: 10 * 1024,
  cache_type_k: 'q4_0',
});

WebLLM offloading it to graphics card memory, right?

Yes, WebLLM offload model weight and KV cache to GPU (not just apple silicon, but also nvidia/AMD/Intel Arc GPUs). I couldn't find on google what's the hard limit for WebGPU memory, so I suppose that it can use all available GPU VRAM.

It would be ideal to have support of WebGPU built directly into llama.cpp itself, but that far too complicated, so for now there's not much choice left for us.

  • Is it reasonable to set the n_ctx and n_seq_max to the same value? In the advanced example you only seem to set n_ctx. Do you recommend doing the same?

n_seq_max is always 1 and should not be modified (I should remove it in next release). The reason is because n_seq_max controls number of sequences can be processed in one batch. This is only useful when you have a big server that processes multiple requests at the same time (provided that the server have a beefy nvidia GPU). In our case, we only have single user at one time, so multi-sequence will decrease the performance.

  • Is it reasonable to have n_batch hardcoded at 1024?

If you're not using the model for embedding, 1024 is probably fine. However, embedding models like BERT are non-causal, meaning they need n_batch to be bigger than sequence length.

@felladrin
Copy link
Contributor

felladrin commented May 14, 2024

I've got a 7B Q2_K model working! (Total file size: 2.72 GB)

I was able to use a context up to n_ctx: 9 * 1024 using cache_type_k: "q4_0".

The inference speed was around 2 tokens per second when using 6 threads.

Screenshots from console (click to expand) image image image image image

I've uploaded the split-gguf here. To try it, you can use this model URL array:

Array.from({ length: 45 }, (_, i) => `https://huggingface.co/Felladrin/gguf-sharded-smashed-WizardLM-2-7B/resolve/main/WizardLM-2-7B.Q2_K.shard-${(i + 1).toString().padStart(5, "0")}-of-00045.gguf`)

@felladrin
Copy link
Contributor

felladrin commented May 14, 2024

Now I've got a 7B Q3_K_M working! (Total file size: 3.52 GB)
I think the previous attempt didn't work because I was setting a too-small split size. I've increased to 96 MB per chunk and it's now working.

Array.from({ length: 43 }, (_, i) => `https://huggingface.co/Felladrin/gguf-sharded-Mistral-7B-OpenOrca/resolve/main/Mistral-7B-OpenOrca-Q3_K_M.shard-${(i + 1).toString().padStart(5, "0")}-of-00043.gguf`)

@flatsiedatsie
Copy link
Contributor Author

*stops watching this space ;-)

@flatsiedatsie
Copy link
Contributor Author

flatsiedatsie commented May 15, 2024

I'm not as lucky it seems. The 7B Q3_K_M with 4K context:

Screenshot 2024-05-15 at 23 17 00

Could it be that Wllama doesn't allow swap to be used?

@felladrin
Copy link
Contributor

felladrin commented May 15, 2024

@flatsiedatsie, please confirm if you have set cache_type_k: "q4_0" when loading the model. It seems to be failing due to cache_type_k being f16, as per the screenshot.

@flatsiedatsie
Copy link
Contributor Author

@felladrin You're right! I accidentally had that commented out for some testing.

And.. it's working!!

Thank you both so much! Mistral! On CPU! In the browser! This is a game changer!

@flatsiedatsie
Copy link
Contributor Author

Does n_batch have an effect on memory consumption? Should I set it lower than 1024 for lower contexts? Or is 1024 generally safe?

@felladrin
Copy link
Contributor

I'm happy to see it too!

I usually leave the n_batch unset. By default it will fill it with the same value of n_ctx, and I haven't had problems with memory due to it.
But I use a low n_ctx for my case, 2048. I don't know how it affects the memory when the context is larger.

@ngxson ngxson pinned this issue May 21, 2024
@ngxson ngxson added the bug Something isn't working label Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants