Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Add dataset loading from S3, GCS #765

Merged
merged 3 commits into from
Nov 16, 2023

Conversation

NanoCode012
Copy link
Collaborator

Closes #750

This allows loading from S3/GCS.

Usage: path: s3://path/to/data.jsonl or path: s3://path/to/parquet/dir/. Use same convention as local dataset, passing ds_type where appropriate.

Test:

  • S3
  • GCS

Note: Azure is commented out till we figure out how to pass creds securely

streaming=False,
split=None,
)
ds = load_from_disk(config_dataset.path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this is equivalent, since the previous load_dataset call also has a data_files component to only load specific files from the path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indeed broke my workflow: I have a directory with txt files inside, the load_dataset knew how to handle that, load_from_disk raises an exception

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gordicaleksa , could you please open an Issue along with a sample config? It'll make it easier to remember this.

@jphme
Copy link
Contributor

jphme commented Nov 15, 2023

Very cool feature! Would it in theory be possible to extend the same code/approach also for model loading and checkpoint saving?

@NanoCode012
Copy link
Collaborator Author

@jphme , thanks! That's my next plan. However, this approach relies partially on dataset implementation for cloud loading. We would need a separate (maybe similar) pipeline for model saving.

@casper-hansen
Copy link
Collaborator

How will we differentiate between downloading vs streaming from your cloud storage in the future when streaming is added? Could we just use the same path but differentiate between the two somehow?

@NanoCode012
Copy link
Collaborator Author

How will we differentiate between downloading vs streaming from your cloud storage in the future when streaming is added? Could we just use the same path but differentiate between the two somehow?

Perhaps, we might have a stream: true parameter? Also, if I recall correctly, streaming would be with the Streaming library, right?

@NanoCode012 NanoCode012 merged commit 3cc67d2 into axolotl-ai-cloud:main Nov 16, 2023
4 checks passed
@NanoCode012 NanoCode012 deleted the feat/cloud-loading branch November 16, 2023 05:33
mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023
* Feat: Add dataset loading from S3, GCS

* chore: update docs

* chore: add more info on cloud loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Cloud Storage datasets
5 participants