Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BPE memory leak #39

Open
venual opened this issue Sep 29, 2023 · 7 comments
Open

BPE memory leak #39

venual opened this issue Sep 29, 2023 · 7 comments
Labels
help wanted Extra attention is needed

Comments

@venual
Copy link

venual commented Sep 29, 2023

I dont know if im using it wrong but when creating a new BPE it creates around 20MB of memory and never releases it, on top of that the async_openai::get_max_tokens_chat_message function creates a new bpe in it so big memory usage that never releases after every call

@zurawiki
Copy link
Owner

zurawiki commented Oct 1, 2023

Interesting find! Do you have a code snippet we can reproduce this issue with?

@zurawiki
Copy link
Owner

Hi @venual, I'm following up to see if this is still an issue

@Sytten
Copy link

Sytten commented Nov 13, 2023

We also see this in production, investigating @zurawiki

There should be a way to re-use the BPE for sure in any case.

@Sytten
Copy link

Sytten commented Nov 13, 2023

We did some investigation and the lib does a lot of allocations, specially around regexes (fancy_regex) in _encode_native. Given a 1-2MB input it can easily bubble up to 50mb of RAM usage. Just looking at the code I see a couple of quick wins (example: the transformation of messages from OpenAI types to your types are clones instead of a ref). For sure it would be better to use a single BPE instance too, but I don't think this is the primary memory problem.
We used the lib to estimate the cost of user provided requests before sending them to openai.

@zurawiki
Copy link
Owner

Thanks for the analysis @Sytten.

Is there a specific code snipped we can use to build a regression test to make sure memory stays under a reasonable amount? How are you testing memory usage?

For follow-ups, it looks like we should:

  • Fix fancy_regex's memory usage in _encode_native
  • Reduce uses of clones in the transformation of OpenAI types
  • Only use a single BPE instance

@Sytten
Copy link

Sytten commented Nov 14, 2023

We did mainly manual tests with a codepath that had tiktoken::num_tokens_from_messages.
I used dhat (https://docs.rs/dhat/latest/dhat/) and you can build automated memory tests with it.
What we have seen is that it becomes "exponential" the more data you feed it, so you will easily see the RAM usage spike in an activity monitor when you feed it 300kb. Even if in theory you should not pass the openai service that much data, I would still use that as a benchmark to diff the memory usage.

@zurawiki zurawiki added the help wanted Extra attention is needed label Nov 15, 2023
@zurawiki
Copy link
Owner

So from an initial analysis (see the dhat-heap file below), I confirm time is spent in _encode_native specifically around matching the fancy_regex.

dhat-heap.json

There are performance notes noting that regex takes a fair bit of CPU. I noticed that the regex used in the original tiktoken library used the negative lookahead operator (?! which can create performance problems over time.

"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",

We could try to switch to the regex rust crate since we don't necessarily have the threading issues that the Python interop has, but we would need to use a regex without look-around.

Note that in case you want to profile memory tests, I pushed a commit with an example that can be run with

cargo run --example num_tokens_memory --features dhat-heap --release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants