-
-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Add dataset loading from S3, GCS #765
Feat: Add dataset loading from S3, GCS #765
Conversation
ec7ea57
to
dc5fddf
Compare
streaming=False, | ||
split=None, | ||
) | ||
ds = load_from_disk(config_dataset.path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this is equivalent, since the previous load_dataset
call also has a data_files component to only load specific files from the path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indeed broke my workflow: I have a directory with txt files inside, the load_dataset
knew how to handle that, load_from_disk
raises an exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gordicaleksa , could you please open an Issue along with a sample config? It'll make it easier to remember this.
Very cool feature! Would it in theory be possible to extend the same code/approach also for model loading and checkpoint saving? |
@jphme , thanks! That's my next plan. However, this approach relies partially on |
How will we differentiate between downloading vs streaming from your cloud storage in the future when streaming is added? Could we just use the same path but differentiate between the two somehow? |
Perhaps, we might have a |
3054512
to
51a530c
Compare
* Feat: Add dataset loading from S3, GCS * chore: update docs * chore: add more info on cloud loading
Closes #750
This allows loading from S3/GCS.
Usage:
path: s3://path/to/data.jsonl
orpath: s3://path/to/parquet/dir/
. Use same convention as local dataset, passingds_type
where appropriate.Test:
Note: Azure is commented out till we figure out how to pass creds securely