Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update LitData version and restore previous LitData assertions in tests #1609

Merged
merged 7 commits into from
Jul 22, 2024

Conversation

awaelchli
Copy link
Member

@awaelchli awaelchli commented Jul 22, 2024

Restores the assertions to what they were before updating LitData in #1573. LitData 0.2.17 has fixes for handling drop_last, which was causing the previous changes in the assertions and number of samples returned.
This also addresses the comment from @carmocca #1573 (comment)

@awaelchli awaelchli marked this pull request as ready for review July 22, 2024 13:22
@awaelchli awaelchli requested a review from lantiga as a code owner July 22, 2024 13:22
@awaelchli awaelchli requested a review from rasbt July 22, 2024 13:22
Copy link
Collaborator

@rasbt rasbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

chunk_bytes="200MB",
)
optimize(
fn=partial(tokenize, split_dataset["val"]),
inputs=list(range(len(split_dataset["val"]))),
output_dir=self.data_path_val,
num_workers=(os.cpu_count() - 1),
num_workers=min(8, os.cpu_count() - 1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good change. There were a couple of people who reported issues where everything hangs / gets very slow, and it might have been due to too many workers and processes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but it's not why I did it. It's just that in this dataset, there aren't that many documents. So it's wasteful to create more workers than there are items to process, some workers would just not get anything to do. Like on the A100 we have 256 cores but that's too much for this dataset.

@awaelchli awaelchli merged commit 12fb3cb into main Jul 22, 2024
9 checks passed
@awaelchli awaelchli deleted the udpate-lit-data-2 branch July 22, 2024 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants