Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk file reads and tokenization for text to mds conversion #1240

Merged
merged 5 commits into from
May 28, 2024

Conversation

irenedea
Copy link
Contributor

@irenedea irenedea commented May 25, 2024

Read files and tokenize in 1MB chunks.

Addresses two issues:

  1. Tokenization is significantly slower on long sequences.
  2. Loading extremely large files into memory at once can cause OOMs

Manual Tests

Convert files of specific sizes

With changes:
test-mds-conversion-500mb-rZDKAm Took 8 minutes
test-mds-conversion-baseline-5gb-HRj4NE (ignore the fact that baseline is in the name lol) Took 1.5 hours

Without changes:
test-mds-conversion-baseline-500mb-QJy3gj Hanging-- stopped after 2 days

Training 1 epoch with sec filings small dataset

Confirmed that the token counts are the same.

With changes:
test-mds-conversion-mpt-7b-TRKKfd

Without changes:
test-mds-conversion-baseline-mpt-7b-XTf6c2

Training 1 epoch with 5.5MB file

Confirmed that the token counts are the same.

With changes:
test-mds-conversion-mpt-7b-5mb-UY6HtO

Without changes:
test-mds-conversion-baseline-mpt-7b-5mb-71friv

Copy link
Collaborator

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it doing it in chunks make it faster?

@irenedea irenedea enabled auto-merge (squash) May 28, 2024 16:47
@irenedea
Copy link
Contributor Author

@mvpatel2000 Tokenizing large strings at once is really slow. HF tokenizers are typically optimized to shorter strings: huggingface/transformers#25873 (comment)

@irenedea irenedea merged commit 43d149b into mosaicml:main May 28, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants