Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support streaming from cloud storage for downloading training data #585

Open
5 tasks done
casper-hansen opened this issue Sep 15, 2023 · 11 comments
Open
5 tasks done
Labels
enhancement New feature or request

Comments

@casper-hansen
Copy link
Collaborator

⚠️ Please check that this feature request hasn't been suggested before.

  • I searched previous Ideas in Discussions didn't find any similar feature requests.
  • I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

Streaming data straight from cloud storage as the training of a model is ongoing is a great feature to have because it will inevitably be cheaper to stream than to rent large clusters and download a large dataset. Especially when running multi-node, this becomes important.

The idea is that you can store your data in an S3/cloud storage and directly stream batches of examples as you are training. This enables deterministic training and it will make it easier to recover from a hardware failure.

✔️ Solution

Integrate with MosaicML's streaming library.
https://github.com/mosaicml/streaming

❓ Alternatives

No response

📝 Additional Context

No response

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this feature has not been requested yet.
  • I have provided enough information for the maintainers to understand and evaluate this request.
@casper-hansen casper-hansen added the enhancement New feature or request label Sep 15, 2023
@casper-hansen casper-hansen changed the title Support streaming cloud storage for downloading training data Support streaming from cloud storage for downloading training data Oct 10, 2023
@NanoCode012
Copy link
Collaborator

I just created a PR #765 for loading dataset from cloud storage (S3,GCS). This is not same as Streaming as it downloads the entire thing, but just wanted to share in case it fits anyone's use case.

Streaming is definitely on our TODO radar as well!

@casper-hansen
Copy link
Collaborator Author

The new PR is a good use-case as well. Just need streaming enabled to stream in data

@fmv1992
Copy link

fmv1992 commented Apr 10, 2024

I'm working on this.

I propose the following addition:

    # loading from s3 or gcs
    # s3 creds will be loaded from the system default and gcs only supports public access
  - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs.

    # loading from s3 or gcs
    # s3 creds will be loaded from the system default and gcs only supports public access
  - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs.
    streaming: true

So a new key to datasets.path (streaming) with a default value of no. If set to yes then it shall use StreamingDataset.

What do you think?

@NanoCode012
Copy link
Collaborator

@fmv1992 , hey, thanks for comment. This sounds great. Would you be able to let us know what’s the drawback to using mosaic’s streaming dataset method?

@fmv1992
Copy link

fmv1992 commented Apr 11, 2024

@NanoCode012 ,

Would you be able to let us know what’s the drawback to using mosaic’s streaming dataset method?

The only drawback I can see is the introduction of a new optional dependency (with everything that comes with it). Other than that their implementation looks pretty solid.

The alternative would be for us to implement this ourselves, and this can get quite big depending on the features you want. At the very least one needs to download a few files in parallel (or use a "view" of a file that supports this partial reading feature), manage their rotation after use, and keep track of this rotation. All in all using a 3rd party library seems the most efficient solution here.

I'm eager to hear your thoughts and I'm open to alternatives as well.

@ehartford
Copy link
Collaborator

What about pretraining

@fmv1992
Copy link

fmv1992 commented Apr 11, 2024

What about pretraining

I like the idea, but this is my first contribution to this repo. I would feel more comfortable doing this as small sized and small scoped as possible. I think supporting pretraining later will be easy once we have this merged and agreed upon the details (when the PR is merged).

@NanoCode012
Copy link
Collaborator

Mosaic's streaming sound like a solid option. The dependency should be fine as the base packages required by it does not seem to clash with current packages.

Do you want to outline the changes you make first, so we can run through it, or would you prefer making a PR directly instead?

@fmv1992
Copy link

fmv1992 commented Apr 11, 2024

@NanoCode012 , this is a sketch of what I'm doing:

——————————
tmp git_diff_to_image 1712835068 bVOCzQ
——————————

Any criticism is welcome.

If I move the import to inside load_streaming_dataset I prevent any import errors from the optional package. I've seen the alternative pattern of:

has_mosaic_streaming_support = False
try:
    from streaming import StreamingDataset
    has_mosaic_streaming_support = True
except ImportError:
    pass

@NanoCode012
Copy link
Collaborator

NanoCode012 commented Apr 11, 2024

I think the section you're editing is for pretraining_dataset. If that was your intention, ignore the next part, else, you would need to edit in the current cloud section https://github.com/OpenAccess-AI-Collective/axolotl/blob/5ed29393e34cf57b24a20ac1bafa3a94272ac3f5/src/axolotl/utils/data/sft.py#L221-L250 and https://github.com/OpenAccess-AI-Collective/axolotl/blob/5ed29393e34cf57b24a20ac1bafa3a94272ac3f5/src/axolotl/utils/data/sft.py#L318-L333 .


One part I recall about Mosaic streaming was that it required conversion to its DatasetFormat: https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/dataset_format.html#introduction

How would you deal with this? Or does this expect that the user has already converted it?

Imo, the current dataset cloud implementation is quite bare. I'm open to removing it to this streaming method if it's cleaner.


Re: import. I don't think we have a specific preference. Maybe just a simple utility, check_streaming_installed . See this function: check_mamba_ssm_installed

@fmv1992
Copy link

fmv1992 commented Apr 16, 2024

I thought it was better to add the PR directly: #1525 . Let me know what you think.

(I suggest we move further discussion to that PR).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants