Chunk file reads and tokenization for text to mds conversion #1240

irenedea · 2024-05-25T01:30:06Z

Read files and tokenize in 1MB chunks.

Addresses two issues:

Tokenization is significantly slower on long sequences.
Loading extremely large files into memory at once can cause OOMs

Manual Tests

Convert files of specific sizes

With changes:
test-mds-conversion-500mb-rZDKAm Took 8 minutes
test-mds-conversion-baseline-5gb-HRj4NE (ignore the fact that baseline is in the name lol) Took 1.5 hours

Without changes:
test-mds-conversion-baseline-500mb-QJy3gj Hanging-- stopped after 2 days

Training 1 epoch with sec filings small dataset

Confirmed that the token counts are the same.

With changes:
test-mds-conversion-mpt-7b-TRKKfd

Without changes:
test-mds-conversion-baseline-mpt-7b-XTf6c2

Training 1 epoch with 5.5MB file

Confirmed that the token counts are the same.

With changes:
test-mds-conversion-mpt-7b-5mb-UY6HtO

Without changes:
test-mds-conversion-baseline-mpt-7b-5mb-71friv

mvpatel2000

Why does it doing it in chunks make it faster?

irenedea · 2024-05-28T16:57:40Z

@mvpatel2000 Tokenizing large strings at once is really slow. HF tokenizers are typically optimized to shorter strings: huggingface/transformers#25873 (comment)

scripts/data_prep/convert_text_to_mds.py

irenedea added 4 commits May 25, 2024 01:28

Chunk file reads and tokenization

4ada4b5

Finish up tokens

a6d54b7

Update comment

e1afb46

Merge branch 'main' into chunks

3cffdeb

irenedea requested review from dakinggg, mvpatel2000, milocress and KuuCi May 27, 2024 06:08

mvpatel2000 approved these changes May 28, 2024

View reviewed changes

Merge branch 'main' into chunks

861d4a0

irenedea enabled auto-merge (squash) May 28, 2024 16:47

XiaohanZhangCMU reviewed May 28, 2024

View reviewed changes

scripts/data_prep/convert_text_to_mds.py Show resolved Hide resolved

irenedea merged commit 43d149b into mosaicml:main May 28, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk file reads and tokenization for text to mds conversion #1240

Chunk file reads and tokenization for text to mds conversion #1240

irenedea commented May 25, 2024 •

edited

Loading

mvpatel2000 left a comment

irenedea commented May 28, 2024

Chunk file reads and tokenization for text to mds conversion #1240

Chunk file reads and tokenization for text to mds conversion #1240

Conversation

irenedea commented May 25, 2024 • edited Loading

Manual Tests

Convert files of specific sizes

Training 1 epoch with sec filings small dataset

Training 1 epoch with 5.5MB file

mvpatel2000 left a comment

Choose a reason for hiding this comment

irenedea commented May 28, 2024

irenedea commented May 25, 2024 •

edited

Loading