Update LitData version and restore previous LitData assertions in tests #1609

awaelchli · 2024-07-22T12:41:26Z

Restores the assertions to what they were before updating LitData in #1573. LitData 0.2.17 has fixes for handling drop_last, which was causing the previous changes in the assertions and number of samples returned.
This also addresses the comment from @carmocca #1573 (comment)

rasbt

Thanks for this!

rasbt · 2024-07-22T14:13:31Z

litgpt/data/openwebtext.py

            chunk_bytes="200MB",
        )
        optimize(
            fn=partial(tokenize, split_dataset["val"]),
            inputs=list(range(len(split_dataset["val"]))),
            output_dir=self.data_path_val,
-            num_workers=(os.cpu_count() - 1),
+            num_workers=min(8, os.cpu_count() - 1),


That's a good change. There were a couple of people who reported issues where everything hangs / gets very slow, and it might have been due to too many workers and processes.

Sure, but it's not why I did it. It's just that in this dataset, there aren't that many documents. So it's wasteful to create more workers than there are items to process, some workers would just not get anything to do. Like on the A100 we have 256 cores but that's too much for this dataset.

awaelchli added 7 commits July 18, 2024 10:16

Update litdata

e0585d3

udpate

f56f1ec

Merge branch 'main' into udpate-lit-data-2

8254489

streaming dataloader

6934c47

imports

000882f

Merge branch 'main' into udpate-lit-data-2

5066576

update test

1a6ed7e

awaelchli marked this pull request as ready for review July 22, 2024 13:22

awaelchli requested a review from lantiga as a code owner July 22, 2024 13:22

awaelchli requested a review from rasbt July 22, 2024 13:22

awaelchli mentioned this pull request Jul 22, 2024

Update LitData to latest version 0.2.16 #1573

Merged

rasbt approved these changes Jul 22, 2024

View reviewed changes

awaelchli merged commit 12fb3cb into main Jul 22, 2024
9 checks passed

awaelchli deleted the udpate-lit-data-2 branch July 22, 2024 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update LitData version and restore previous LitData assertions in tests #1609

Update LitData version and restore previous LitData assertions in tests #1609

awaelchli commented Jul 22, 2024 •

edited

Loading

rasbt left a comment

rasbt Jul 22, 2024

awaelchli Jul 22, 2024

Update LitData version and restore previous LitData assertions in tests #1609

Update LitData version and restore previous LitData assertions in tests #1609

Conversation

awaelchli commented Jul 22, 2024 • edited Loading

rasbt left a comment

Choose a reason for hiding this comment

rasbt Jul 22, 2024

Choose a reason for hiding this comment

awaelchli Jul 22, 2024

Choose a reason for hiding this comment

awaelchli commented Jul 22, 2024 •

edited

Loading