Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

splitText function is too long #140

Open
kevin-btc opened this issue Jul 4, 2024 · 2 comments
Open

splitText function is too long #140

kevin-btc opened this issue Jul 4, 2024 · 2 comments

Comments

@kevin-btc
Copy link
Contributor

I've noticed that the splitText function is running pretty slow. When it's called on its own, it takes about 150 to 300 milliseconds. But when it's used on a whole list of transcripts in the frontend, it takes way too long and really slows down the app.

We need the splitText function to work faster, even with a big list of transcripts, to keep the app running smoothly.

As a quick fix, I've switched to using TokenTextSplitter from the langchain library, which is a lot faster for my needs. But this is just a temporary solution, and it would be great to have a more permanent fix in the polyfire-js library.

@lowczarc
Copy link
Member

lowczarc commented Jul 5, 2024

I had similar problems in the api part a while ago.

A big optimization is to call encode once and do the splitting directly on the tokens then decode everything.

But even outside of that, I don't know if we still need an algorithm that complex. Right now it's trying as much as possible to cut between paragraphs first, lines second, sentences third etc.... while trying to have chunks as even as possible.
I feel like it's something we needed during the autodoc era but isn't really relevant anymore.

Maybe we could just do the same thing as in the api and just cut at the chunkSize limit or at least just enforce a sentence rule (where we would just split at every full stop, encode, merge sentences until they make a chunk size and decode every chunk)

@victorforissier
Copy link
Contributor

victorforissier commented Jul 5, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants