-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Distribute NLTK tokenizers used in the core package #829
Comments
NLTK also poses problems for the AWS Lambda runtime. Also, agree on the fact that the size of the tokenizer itself is quite large. This is roughly 15% of the total allowed dependency size (~250MB) when deploying Lambdas with layers. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 14 days. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 14 days. |
Instead of distributing, we're taking a different approach to not require nltk in the core package in favor of other, lighter tokenizers; while some validators may still require nltk, it will not be necessary for all. @AlejandroEsquivel should have more details on this soon. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 14 days. |
Description
[Add a description of the feature]
Since we now required
nltk
and thepunkt
tokenizer during the validation loop for chunking during streaming, we should either download and distribute thepunkt
tokenizer with the library or find a way to include it during the install phase. From what I can see the only way to perform a post-install flow is if we switch back to setuptools instead of poetry, but even that may not work for all distribution methods.Why is this needed
[If you have a concrete use case, add details here.]
Currently we download the tokenizer, if it dosen't exist, during runtime which can cause issues in certain environments like kubernetes. See #821
Implementation details
The simplest path would be to download the tokenizer during our deploy script and included it in the distributable.
The downside to this approach is the tokenizer is ~38 MB.
An alternative, like previously mentioned, is to abandon Poetry and switch back to setuptools. This should allow us to implement post-install functionality in the setup.py; though we would need to verify this works in all the various ways the library can be installed.
Another alternative is to find a smaller, installable tokenizer to perform chunking.
End result
[How should this feature be used?]
No nltk downloads are performed during runtime.
The text was updated successfully, but these errors were encountered: