-
-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support streaming from cloud storage for downloading training data #585
Comments
I just created a PR #765 for loading dataset from cloud storage (S3,GCS). This is not same as
|
The new PR is a good use-case as well. Just need streaming enabled to stream in data |
I'm working on this. I propose the following addition:
→
So a new key to What do you think? |
@fmv1992 , hey, thanks for comment. This sounds great. Would you be able to let us know what’s the drawback to using mosaic’s streaming dataset method? |
The only drawback I can see is the introduction of a new optional dependency (with everything that comes with it). Other than that their implementation looks pretty solid. The alternative would be for us to implement this ourselves, and this can get quite big depending on the features you want. At the very least one needs to download a few files in parallel (or use a "view" of a file that supports this partial reading feature), manage their rotation after use, and keep track of this rotation. All in all using a 3rd party library seems the most efficient solution here. I'm eager to hear your thoughts and I'm open to alternatives as well. |
What about pretraining |
I like the idea, but this is my first contribution to this repo. I would feel more comfortable doing this as small sized and small scoped as possible. I think supporting pretraining later will be easy once we have this merged and agreed upon the details (when the PR is merged). |
Mosaic's streaming sound like a solid option. The dependency should be fine as the base packages required by it does not seem to clash with current packages. Do you want to outline the changes you make first, so we can run through it, or would you prefer making a PR directly instead? |
@NanoCode012 , this is a sketch of what I'm doing: Any criticism is welcome. If I move the import to inside
|
I think the section you're editing is for One part I recall about Mosaic streaming was that it required conversion to its DatasetFormat: https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/dataset_format.html#introduction How would you deal with this? Or does this expect that the user has already converted it? Imo, the current dataset cloud implementation is quite bare. I'm open to removing it to this streaming method if it's cleaner. Re: import. I don't think we have a specific preference. Maybe just a simple utility, |
I thought it was better to add the PR directly: #1525 . Let me know what you think. (I suggest we move further discussion to that PR). |
🔖 Feature description
Streaming data straight from cloud storage as the training of a model is ongoing is a great feature to have because it will inevitably be cheaper to stream than to rent large clusters and download a large dataset. Especially when running multi-node, this becomes important.
The idea is that you can store your data in an S3/cloud storage and directly stream batches of examples as you are training. This enables deterministic training and it will make it easier to recover from a hardware failure.
✔️ Solution
Integrate with MosaicML's streaming library.
https://github.com/mosaicml/streaming
❓ Alternatives
No response
📝 Additional Context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: