PostMessage: Data cannot be cloned, out of memory #12

flatsiedatsie · 2024-05-04T16:43:51Z

I'm trying to load Mistral 7B 32K. I've chunked the 4.3GB model and uploaded it to huggingface.

When the download is seemingly complete, there is a warning about being out of memory:

It's a little odd, as I normally load bigger chunked models (Llama 8B) with WebLLM. The task manager also indicates that memory pressure is medium.

ngxson · 2024-05-04T18:08:15Z

Yes seems like the issue is due to the way multiple files are being copied onto web worker: we're currently copying all shards at once, which may cause it to run out of memory. The fix would be:

Copy one file at once ==> This is the easy fix
Even better, we can move download function to webworker, so no copy is needed (but with the cost of making harder to control) ==> The hard way to fix
Maybe use SharedArrayBuffer when possible to avoid copy ==> Should be easy to implement

flatsiedatsie · 2024-05-08T08:40:02Z

This is becoming a bit of a show stopper unfortunately. It seems to even affect small models that would load under llama_cpp_wasm , such as NeuralReyna :-(

If you could help fix this issue, or give some pointers on how I could attempt to do so myself, that would be greatly appreciated. At this point I don't mind if a fix is slow or sub-optimal. I just want wllama to be reliable.

ngxson · 2024-05-08T09:11:48Z

I'm planning to work on this issue in next days. It maybe more complicated than it looks, so I'll need time to figure that out. Please be patient.

flatsiedatsie · 2024-05-08T09:48:19Z

That's great news! Thank you so much!

ngxson · 2024-05-10T09:48:56Z

FYI, v1.7.0 has been released. It also come with support for progressCallback, please see "advanced" example:

wllama/examples/advanced/index.html

Lines 53 to 57 in d1ceeb6

    
           await wllama.loadModelFromUrl(MODEL, { 
        
             embeddings: true, 
        
             n_ctx: 1024, 
        
             progressCallback: ({ loaded, total }) => console.log(`Downloading... ${Math.round(loaded/total*100)}%`), 
        
           });

This issue (out-of-memory) is hopefully fixed by #14 , but I'm not 100% sure. Please try again & let me know if it works.

Also, it's now recommended to split the model into chunks of 256MB or 512MB. Again, see "advanced" example:

wllama/examples/advanced/index.html

Lines 38 to 45 in d1ceeb6

    
               // Or, try loading a bigger model (1.3GB in total) 
        
               /*const MODEL_SPLITS = [ 
        
                 'https://huggingface.co/ngxson/test_gguf_models/resolve/main/neuralreyna-mini-1.8b-v0.3.q4_k_m-00001-of-00005.gguf', 
        
                 'https://huggingface.co/ngxson/test_gguf_models/resolve/main/neuralreyna-mini-1.8b-v0.3.q4_k_m-00002-of-00005.gguf', 
        
                 'https://huggingface.co/ngxson/test_gguf_models/resolve/main/neuralreyna-mini-1.8b-v0.3.q4_k_m-00003-of-00005.gguf', 
        
                 'https://huggingface.co/ngxson/test_gguf_models/resolve/main/neuralreyna-mini-1.8b-v0.3.q4_k_m-00004-of-00005.gguf', 
        
                 'https://huggingface.co/ngxson/test_gguf_models/resolve/main/neuralreyna-mini-1.8b-v0.3.q4_k_m-00005-of-00005.gguf', 
        
               ];*/

Also have a look at updated README: https://github.com/ngxson/wllama/tree/master?tab=readme-ov-file#prepare-your-model

Thank you!

flatsiedatsie · 2024-05-11T07:05:28Z

The readme mentions the progress feature (very nice bonus, thank you!), but just to be sure: does this also address the memory issue? Or is the intended fix for that to make the chunks smaller?

Ah, reading again..

Also, it's now recommended to split the model into chunks of 256MB or 512MB.

OK, I'll do that. Thank you.

flatsiedatsie · 2024-05-11T22:28:24Z

I'm seeing this error after creating a chunked model of Open Buddy Mistral 7B 32k Q4_K_M with 50 x 100Mb chunks:

		"download_url":[
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00001-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00002-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00003-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00004-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00005-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00006-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00007-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00008-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00009-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00010-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00011-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00012-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00013-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00014-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00015-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00016-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00017-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00018-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00019-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00020-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00021-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00022-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00023-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00024-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00025-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00026-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00027-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00028-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00029-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00030-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00031-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00032-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00033-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00034-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00035-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00036-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00037-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00038-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00039-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00040-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00041-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00042-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00043-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00044-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00045-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00046-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00047-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00048-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00049-of-00050.gguf",
			"https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked/resolve/main/open_buddy_mistral-00050-of-00050.gguf",
		],

ngxson · 2024-05-12T22:37:50Z

I'm seeing this error after creating a chunked model of Open Buddy Mistral 7B 32k Q4_K_M with 50 x 100Mb chunks:

@flatsiedatsie FYI, I uploaded v1.8.0 which should display a better error message (I don't know if it fixes the mentioned issue or not). Could you try again and see what's the error? Thanks.

flatsiedatsie · 2024-05-13T11:35:39Z

I will. I've been trying lots of things actually. But unfortunately still having trouble loading models that WebLLM does load.

The following screenshots are not so much bugs, I've managed to solve some (basically by reducing context size).

flatsiedatsie · 2024-05-13T11:37:29Z

I'm also still looking into your suggestion that it may be that the model is trying to load twice.

ngxson · 2024-05-13T11:47:54Z

You screenshot still shows "_wllama_decode_exception", which is already been removed in 1.8.0. Maybe your code is not using the latest version.

flatsiedatsie · 2024-05-13T13:26:10Z

Correct, those are screenshots from yesterdy. I'm updating it now.

flatsiedatsie · 2024-05-13T17:03:30Z

OK, I've done some more testing. TL/DR: Thing are running a lot smoother now! It's just the big models or big contexts that run out of memory.

But before I get into that, let me give a little context about what I'm trying to achieve. I'm trying to create a 100% browser-based online tool where people can not only chat with AI, but use it to work on documents. For that I need two types of models:

A small model with a huge context for summarization tasks.

Small: Danube with 8K context is great for memory-poor mobile phones.
Medium: NeuralReyna, is a step up, as it has a 32K context.
Large: Phi 3 with 128K context.

Use a large model with a relatively small context for more demanding tasks, like rewriting part of a document in a different tone.

Small: I'm not sure yet.
Medium: Mistral 7B with 4K
Large: Llama 3 8B is the top of the line.

Mistral 7B with 32K context could be a good "middle of the road do-it-all" option, so I've been trying to run that with Wllama today.

I started with by using your example code to eliminate the possiblity of bugs in my project being the cause of issues. I also rebooted my laptop first (Macbook Pro with 16Gb of ram) to have as much available memory as possible. Once I found that I got the same results with the example as with my code, I mostly reverted back to my project.

Qwen 0.5 (GGUF)

The only model I've been able to get to work with 16K context. Crashes on it's theoretical maximum, 32K.

NeuralReyna

In my main code I can now load NeuralReyna. Howver, if I try to use the full 32K, or even 16K, there are once again memory issues. With 8K it doesn't crash.

Phi 3 - 4K

I chunked it in 250Mb parts, and it loads! Nice!

Phi 3 - 128K

Here I tried to directly load a 1.96Gb .gguf file (Q3_K) and even that worked! This is pretty great, as Llama.cpp support for this model is right around the corner.

To be clear, I used it with a 4K context, since Llama.cpp doesn't support bigger context yet.

Mistral 7B - 32K

This model has memory issues. To make sure it wasn't my code I tried loading the model in the advanced example too. Same result. Even setting the context to 1K doesn't help. The chunks I'm using are available here: https://huggingface.co/BoscoTheDog/open_buddy_mistral_7B_32k_chunked

With version 1.8 Wllama doesn't seem to raise an error though? It just just states the issue in the console. But my code thinks the model has loaded OK, even though it hasn't. Is there a way to get the failed state?

In summary, only the bigger models/contexts now seem to run into issues.

You could argue the model is just "too big". But from using WebLLM I know that it should be possible to run it in the browser, with memory to spare. Similarly, the even bigger Llama 3 8B 4K can run under WebLLM. And since the Mac's have unified memory, I can't blame it on WebLLM offloading it to graphics card memory, right?
You could argue "Well, run Mistral 7B through WebLLM then". But WebLLM only runs when a WebGPU is available. It would be awesome to seemlessly switch between Wllama and WebLLM in the background, depending on WebGPU support.

I stilll have to test what happens on devices with less memory (e.g. 8Gb Macbook Air).

Finally, I just want to say: thank you for all your work on this! It's such an exciting development. Everybody talks about bringing AI to the masses, but too few people realize the browser is the best platform to do that with. Wllama is awesome!

flatsiedatsie · 2024-05-13T20:38:49Z

Just a quick check:

Is it reasonable to set the n_ctx and n_seq_max to the same value? In the advanced example you only seem to set n_ctx. Do you recommend doing the same?
Is it reasonable to have n_batch hardcoded at 1024?

ngxson · 2024-05-13T21:32:08Z

Thank you for the very detailed info!

It's true that we will definitely be struggle with the memory issue, because AFAIK browsers does have some limits on memory usage. Optimizing memory usage will surely be an area that I'll need to invest my time into.

Howver, if I try to use the full 32K, or even 16K, there are once again memory issues. With 8K it doesn't crash.

FYI, n_ctx doesn't have to be power of 2. It can be multiple of 1024, for example 10 * 1024 (= 10K)

Another trick to reduce memory usage is by using quantize q4_0 for cache_type_k, for example:

wllama.loadModelFromUrl(MODEL, {
  n_ctx: 10 * 1024,
  cache_type_k: 'q4_0',
});

WebLLM offloading it to graphics card memory, right?

Yes, WebLLM offload model weight and KV cache to GPU (not just apple silicon, but also nvidia/AMD/Intel Arc GPUs). I couldn't find on google what's the hard limit for WebGPU memory, so I suppose that it can use all available GPU VRAM.

It would be ideal to have support of WebGPU built directly into llama.cpp itself, but that far too complicated, so for now there's not much choice left for us.

Is it reasonable to set the n_ctx and n_seq_max to the same value? In the advanced example you only seem to set n_ctx. Do you recommend doing the same?

n_seq_max is always 1 and should not be modified (I should remove it in next release). The reason is because n_seq_max controls number of sequences can be processed in one batch. This is only useful when you have a big server that processes multiple requests at the same time (provided that the server have a beefy nvidia GPU). In our case, we only have single user at one time, so multi-sequence will decrease the performance.

Is it reasonable to have n_batch hardcoded at 1024?

If you're not using the model for embedding, 1024 is probably fine. However, embedding models like BERT are non-causal, meaning they need n_batch to be bigger than sequence length.

felladrin · 2024-05-14T22:32:08Z

I've got a 7B Q2_K model working! (Total file size: 2.72 GB)

I was able to use a context up to n_ctx: 9 * 1024 using cache_type_k: "q4_0".

The inference speed was around 2 tokens per second when using 6 threads.

Screenshots from console (click to expand)

I've uploaded the split-gguf here. To try it, you can use this model URL array:

Array.from({ length: 45 }, (_, i) => `https://huggingface.co/Felladrin/gguf-sharded-smashed-WizardLM-2-7B/resolve/main/WizardLM-2-7B.Q2_K.shard-${(i + 1).toString().padStart(5, "0")}-of-00045.gguf`)

felladrin · 2024-05-14T23:18:19Z

Now I've got a 7B Q3_K_M working! (Total file size: 3.52 GB)
I think the previous attempt didn't work because I was setting a too-small split size. I've increased to 96 MB per chunk and it's now working.

Array.from({ length: 43 }, (_, i) => `https://huggingface.co/Felladrin/gguf-sharded-Mistral-7B-OpenOrca/resolve/main/Mistral-7B-OpenOrca-Q3_K_M.shard-${(i + 1).toString().padStart(5, "0")}-of-00043.gguf`)

flatsiedatsie · 2024-05-15T20:10:56Z

*stops watching this space ;-)

flatsiedatsie · 2024-05-15T21:18:41Z

I'm not as lucky it seems. The 7B Q3_K_M with 4K context:

Could it be that Wllama doesn't allow swap to be used?

felladrin · 2024-05-15T21:43:56Z

@flatsiedatsie, please confirm if you have set cache_type_k: "q4_0" when loading the model. It seems to be failing due to cache_type_k being f16, as per the screenshot.

flatsiedatsie · 2024-05-16T07:09:56Z

@felladrin You're right! I accidentally had that commented out for some testing.

And.. it's working!!

Thank you both so much! Mistral! On CPU! In the browser! This is a game changer!

flatsiedatsie · 2024-05-16T07:14:37Z

Does n_batch have an effect on memory consumption? Should I set it lower than 1024 for lower contexts? Or is 1024 generally safe?

felladrin · 2024-05-16T07:32:13Z

I'm happy to see it too!

I usually leave the n_batch unset. By default it will fill it with the same value of n_ctx, and I haven't had problems with memory due to it.
But I use a low n_ctx for my case, 2048. I don't know how it affects the memory when the context is larger.

ymuichiro mentioned this issue May 7, 2024

Does not operate on Llama-3 or Phi-3. tangledgroup/llama-cpp-wasm#7

Closed

ngxson mentioned this issue May 10, 2024

Free buffer after uploaded to worker #14

Merged

ngxson mentioned this issue May 14, 2024

After upgrading to version 1.8.0, the async function loadModelFromUrl is not completing when using large models #31

Open

ngxson pinned this issue May 21, 2024

ngxson added the bug Something isn't working label Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PostMessage: Data cannot be cloned, out of memory #12

PostMessage: Data cannot be cloned, out of memory #12

flatsiedatsie commented May 4, 2024

ngxson commented May 4, 2024 •

edited

Loading

flatsiedatsie commented May 8, 2024 •

edited

Loading

ngxson commented May 8, 2024

flatsiedatsie commented May 8, 2024

ngxson commented May 10, 2024 •

edited

Loading

flatsiedatsie commented May 11, 2024 •

edited

Loading

flatsiedatsie commented May 11, 2024

ngxson commented May 12, 2024

flatsiedatsie commented May 13, 2024

flatsiedatsie commented May 13, 2024 •

edited

Loading

ngxson commented May 13, 2024

flatsiedatsie commented May 13, 2024

flatsiedatsie commented May 13, 2024 •

edited

Loading

flatsiedatsie commented May 13, 2024 •

edited

Loading

ngxson commented May 13, 2024 •

edited

Loading

felladrin commented May 14, 2024 •

edited

Loading

felladrin commented May 14, 2024 •

edited

Loading

flatsiedatsie commented May 15, 2024

flatsiedatsie commented May 15, 2024 •

edited

Loading

felladrin commented May 15, 2024 •

edited

Loading

flatsiedatsie commented May 16, 2024

flatsiedatsie commented May 16, 2024

felladrin commented May 16, 2024

PostMessage: Data cannot be cloned, out of memory #12

PostMessage: Data cannot be cloned, out of memory #12

Comments

flatsiedatsie commented May 4, 2024

ngxson commented May 4, 2024 • edited Loading

flatsiedatsie commented May 8, 2024 • edited Loading

ngxson commented May 8, 2024

flatsiedatsie commented May 8, 2024

ngxson commented May 10, 2024 • edited Loading

flatsiedatsie commented May 11, 2024 • edited Loading

flatsiedatsie commented May 11, 2024

ngxson commented May 12, 2024

flatsiedatsie commented May 13, 2024

flatsiedatsie commented May 13, 2024 • edited Loading

ngxson commented May 13, 2024

flatsiedatsie commented May 13, 2024

flatsiedatsie commented May 13, 2024 • edited Loading

flatsiedatsie commented May 13, 2024 • edited Loading

ngxson commented May 13, 2024 • edited Loading

felladrin commented May 14, 2024 • edited Loading

felladrin commented May 14, 2024 • edited Loading

flatsiedatsie commented May 15, 2024

flatsiedatsie commented May 15, 2024 • edited Loading

felladrin commented May 15, 2024 • edited Loading

flatsiedatsie commented May 16, 2024

flatsiedatsie commented May 16, 2024

felladrin commented May 16, 2024

ngxson commented May 4, 2024 •

edited

Loading

flatsiedatsie commented May 8, 2024 •

edited

Loading

ngxson commented May 10, 2024 •

edited

Loading

flatsiedatsie commented May 11, 2024 •

edited

Loading

flatsiedatsie commented May 13, 2024 •

edited

Loading

flatsiedatsie commented May 13, 2024 •

edited

Loading

flatsiedatsie commented May 13, 2024 •

edited

Loading

ngxson commented May 13, 2024 •

edited

Loading

felladrin commented May 14, 2024 •

edited

Loading

felladrin commented May 14, 2024 •

edited

Loading

flatsiedatsie commented May 15, 2024 •

edited

Loading

felladrin commented May 15, 2024 •

edited

Loading