BPE memory leak #39

venual · 2023-09-29T18:33:41Z

I dont know if im using it wrong but when creating a new BPE it creates around 20MB of memory and never releases it, on top of that the async_openai::get_max_tokens_chat_message function creates a new bpe in it so big memory usage that never releases after every call

zurawiki · 2023-10-01T15:24:23Z

Interesting find! Do you have a code snippet we can reproduce this issue with?

zurawiki · 2023-10-16T14:15:53Z

Hi @venual, I'm following up to see if this is still an issue

Sytten · 2023-11-13T19:41:47Z

We also see this in production, investigating @zurawiki

There should be a way to re-use the BPE for sure in any case.

Sytten · 2023-11-13T20:52:20Z

We did some investigation and the lib does a lot of allocations, specially around regexes (fancy_regex) in _encode_native. Given a 1-2MB input it can easily bubble up to 50mb of RAM usage. Just looking at the code I see a couple of quick wins (example: the transformation of messages from OpenAI types to your types are clones instead of a ref). For sure it would be better to use a single BPE instance too, but I don't think this is the primary memory problem.
We used the lib to estimate the cost of user provided requests before sending them to openai.

zurawiki · 2023-11-14T14:02:11Z

Thanks for the analysis @Sytten.

Is there a specific code snipped we can use to build a regression test to make sure memory stays under a reasonable amount? How are you testing memory usage?

For follow-ups, it looks like we should:

Fix fancy_regex's memory usage in _encode_native
Reduce uses of clones in the transformation of OpenAI types
Only use a single BPE instance

Sytten · 2023-11-14T14:27:47Z

We did mainly manual tests with a codepath that had tiktoken::num_tokens_from_messages.
I used dhat (https://docs.rs/dhat/latest/dhat/) and you can build automated memory tests with it.
What we have seen is that it becomes "exponential" the more data you feed it, so you will easily see the RAM usage spike in an activity monitor when you feed it 300kb. Even if in theory you should not pass the openai service that much data, I would still use that as a benchmark to diff the memory usage.

zurawiki · 2023-11-15T05:02:22Z

So from an initial analysis (see the dhat-heap file below), I confirm time is spent in _encode_native specifically around matching the fancy_regex.

dhat-heap.json

There are performance notes noting that regex takes a fair bit of CPU. I noticed that the regex used in the original tiktoken library used the negative lookahead operator (?! which can create performance problems over time.

tiktoken-rs/tiktoken-rs/src/tiktoken_ext/openai_public.rs

Line 116 in 4afb9d3

    
           "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",

We could try to switch to the regex rust crate since we don't necessarily have the threading issues that the Python interop has, but we would need to use a regex without look-around.

Note that in case you want to profile memory tests, I pushed a commit with an example that can be run with

cargo run --example num_tokens_memory --features dhat-heap --release

zurawiki added the help wanted Extra attention is needed label Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BPE memory leak #39

BPE memory leak #39

venual commented Sep 29, 2023

zurawiki commented Oct 1, 2023

zurawiki commented Oct 16, 2023

Sytten commented Nov 13, 2023 •

edited

Loading

Sytten commented Nov 13, 2023

zurawiki commented Nov 14, 2023

Sytten commented Nov 14, 2023

zurawiki commented Nov 15, 2023

BPE memory leak #39

BPE memory leak #39

Comments

venual commented Sep 29, 2023

zurawiki commented Oct 1, 2023

zurawiki commented Oct 16, 2023

Sytten commented Nov 13, 2023 • edited Loading

Sytten commented Nov 13, 2023

zurawiki commented Nov 14, 2023

Sytten commented Nov 14, 2023

zurawiki commented Nov 15, 2023

Sytten commented Nov 13, 2023 •

edited

Loading