[feat] Distribute NLTK tokenizers used in the core package #829

CalebCourier · 2024-06-14T17:03:58Z

Description
[Add a description of the feature]
Since we now required nltk and the punkt tokenizer during the validation loop for chunking during streaming, we should either download and distribute the punkt tokenizer with the library or find a way to include it during the install phase. From what I can see the only way to perform a post-install flow is if we switch back to setuptools instead of poetry, but even that may not work for all distribution methods.

Why is this needed
[If you have a concrete use case, add details here.]
Currently we download the tokenizer, if it dosen't exist, during runtime which can cause issues in certain environments like kubernetes. See #821

Implementation details
The simplest path would be to download the tokenizer during our deploy script and included it in the distributable.

The downside to this approach is the tokenizer is ~38 MB.

An alternative, like previously mentioned, is to abandon Poetry and switch back to setuptools. This should allow us to implement post-install functionality in the setup.py; though we would need to verify this works in all the various ways the library can be installed.

Another alternative is to find a smaller, installable tokenizer to perform chunking.

End result
[How should this feature be used?]
No nltk downloads are performed during runtime.

The text was updated successfully, but these errors were encountered:

vprecup · 2024-07-17T16:42:55Z

NLTK also poses problems for the AWS Lambda runtime. Also, agree on the fact that the size of the tokenizer itself is quite large. This is roughly 15% of the total allowed dependency size (~250MB) when deploying Lambdas with layers.

github-actions · 2024-08-17T01:53:28Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 14 days.

github-actions · 2024-09-19T03:35:59Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 14 days.

CalebCourier · 2024-09-19T13:45:52Z

Instead of distributing, we're taking a different approach to not require nltk in the core package in favor of other, lighter tokenizers; while some validators may still require nltk, it will not be necessary for all. @AlejandroEsquivel should have more details on this soon.

github-actions · 2024-10-20T03:39:35Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 14 days.

CalebCourier added the enhancement New feature or request label Jun 14, 2024

github-actions bot added the Stale label Aug 17, 2024

CalebCourier removed the Stale label Aug 19, 2024

github-actions bot added the Stale label Sep 19, 2024

CalebCourier removed the Stale label Sep 19, 2024

github-actions bot added the Stale label Oct 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Distribute NLTK tokenizers used in the core package #829

[feat] Distribute NLTK tokenizers used in the core package #829

CalebCourier commented Jun 14, 2024

vprecup commented Jul 17, 2024

github-actions bot commented Aug 17, 2024

github-actions bot commented Sep 19, 2024

CalebCourier commented Sep 19, 2024

github-actions bot commented Oct 20, 2024

[feat] Distribute NLTK tokenizers used in the core package #829

[feat] Distribute NLTK tokenizers used in the core package #829

Comments

CalebCourier commented Jun 14, 2024

vprecup commented Jul 17, 2024

github-actions bot commented Aug 17, 2024

github-actions bot commented Sep 19, 2024

CalebCourier commented Sep 19, 2024

github-actions bot commented Oct 20, 2024