-
Notifications
You must be signed in to change notification settings - Fork 974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update LitData version and restore previous LitData assertions in tests #1609
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!
chunk_bytes="200MB", | ||
) | ||
optimize( | ||
fn=partial(tokenize, split_dataset["val"]), | ||
inputs=list(range(len(split_dataset["val"]))), | ||
output_dir=self.data_path_val, | ||
num_workers=(os.cpu_count() - 1), | ||
num_workers=min(8, os.cpu_count() - 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good change. There were a couple of people who reported issues where everything hangs / gets very slow, and it might have been due to too many workers and processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but it's not why I did it. It's just that in this dataset, there aren't that many documents. So it's wasteful to create more workers than there are items to process, some workers would just not get anything to do. Like on the A100 we have 256 cores but that's too much for this dataset.
Restores the assertions to what they were before updating LitData in #1573. LitData 0.2.17 has fixes for handling drop_last, which was causing the previous changes in the assertions and number of samples returned.
This also addresses the comment from @carmocca #1573 (comment)