-
-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
streaming multipack for pretraining dataset #959
streaming multipack for pretraining dataset #959
Conversation
Here's a patch file I used to test a c4 pretraiing dataset with tinnyllama. multigpu doesn't work currently w this since I think it needs a proper data collator to pad the samples to the same sequence length |
Would this streaming feature work with S3, GCS, Azure Blob Storage? |
02dc87f
to
a5eb52e
Compare
This PR is ready for review and should resolve #1026. @mhenrichsen |
eec349a
to
2a49248
Compare
Confirmed working on single gpu. Currently fails on multi gpu. |
This pull request introduces support for multipacking in streaming pretraining datasets.
Due to the immense size of these datasets, traditional methods of loading them entirely into memory are not feasible.
The proposed solution aims to enhance efficiency and scalability.
need guide for making config.
this multipack does not use BatchSamplerDataCollatorForSeq2Seq, only use DataCollatorForSeq2Seq due to huggingface dataset map function.