Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add chunking function for sequence tagger training on sentences exceeding token limit #3520

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

MattGPT-ai
Copy link
Contributor

Closes #3519

Adds a Sentence chunking function to allow SequenceTagger training on sentences exceeding the token limit.
Adds tests for this function

@MattGPT-ai MattGPT-ai force-pushed the GH-3519/add-sentence-chunking-method branch 5 times, most recently from de81c1f to 0b23ef6 Compare August 3, 2024 18:26
@MattGPT-ai MattGPT-ai force-pushed the GH-3519/add-sentence-chunking-method branch from b523769 to 7cf4d0f Compare August 9, 2024 17:35
@MattGPT-ai
Copy link
Contributor Author

Looks like 100% of my tests passed, but it still says my checks failed in the GitHub UI

@alanakbik
Copy link
Collaborator

alanakbik commented Aug 9, 2024

We are getting a System.IO.IOException: No space left on device error for the unit tests as they seem to be taking up too much space. I tried removing some of the dataset downloads in the tests in #3526, but it seems its not enough to prevent this from happening.

@MattGPT-ai
Copy link
Contributor Author

We are getting a System.IO.IOException: No space left on device error for the unit tests as they seem to be taking up too much space. I tried removing some of the dataset downloads in the tests in #3526, but it seems its not enough to prevent this from happening.

Is it possible to just download portions of the datasets? Like 100 samples or something sufficient for unit testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Allow sentences longer than the token limit for sequence tagger training
2 participants