Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Distribute NLTK tokenizers used in the core package #829

Open
CalebCourier opened this issue Jun 14, 2024 · 5 comments
Open

[feat] Distribute NLTK tokenizers used in the core package #829

CalebCourier opened this issue Jun 14, 2024 · 5 comments
Labels
enhancement New feature or request Stale

Comments

@CalebCourier
Copy link
Collaborator

Description
[Add a description of the feature]
Since we now required nltk and the punkt tokenizer during the validation loop for chunking during streaming, we should either download and distribute the punkt tokenizer with the library or find a way to include it during the install phase. From what I can see the only way to perform a post-install flow is if we switch back to setuptools instead of poetry, but even that may not work for all distribution methods.

Why is this needed
[If you have a concrete use case, add details here.]
Currently we download the tokenizer, if it dosen't exist, during runtime which can cause issues in certain environments like kubernetes. See #821

Implementation details
The simplest path would be to download the tokenizer during our deploy script and included it in the distributable.

The downside to this approach is the tokenizer is ~38 MB.

An alternative, like previously mentioned, is to abandon Poetry and switch back to setuptools. This should allow us to implement post-install functionality in the setup.py; though we would need to verify this works in all the various ways the library can be installed.

Another alternative is to find a smaller, installable tokenizer to perform chunking.

End result
[How should this feature be used?]
No nltk downloads are performed during runtime.

@CalebCourier CalebCourier added the enhancement New feature or request label Jun 14, 2024
@vprecup
Copy link

vprecup commented Jul 17, 2024

NLTK also poses problems for the AWS Lambda runtime. Also, agree on the fact that the size of the tokenizer itself is quite large. This is roughly 15% of the total allowed dependency size (~250MB) when deploying Lambdas with layers.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 14 days.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@CalebCourier
Copy link
Collaborator Author

Instead of distributing, we're taking a different approach to not require nltk in the core package in favor of other, lighter tokenizers; while some validators may still require nltk, it will not be necessary for all. @AlejandroEsquivel should have more details on this soon.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the Stale label Oct 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale
Projects
None yet
Development

No branches or pull requests

2 participants