Feat: Add dataset loading from S3, GCS #765

NanoCode012 · 2023-10-22T13:11:48Z

Closes #750

This allows loading from S3/GCS.

Usage: path: s3://path/to/data.jsonl or path: s3://path/to/parquet/dir/. Use same convention as local dataset, passing ds_type where appropriate.

Test:

S3
GCS

Note: Azure is commented out till we figure out how to pass creds securely

src/axolotl/utils/data.py

winglian · 2023-11-15T15:01:49Z

src/axolotl/utils/data.py

-                        streaming=False,
-                        split=None,
-                    )
+                    ds = load_from_disk(config_dataset.path)


I don't believe this is equivalent, since the previous load_dataset call also has a data_files component to only load specific files from the path.

This indeed broke my workflow: I have a directory with txt files inside, the load_dataset knew how to handle that, load_from_disk raises an exception

@gordicaleksa , could you please open an Issue along with a sample config? It'll make it easier to remember this.

jphme · 2023-11-15T15:16:26Z

Very cool feature! Would it in theory be possible to extend the same code/approach also for model loading and checkpoint saving?

NanoCode012 · 2023-11-15T15:20:14Z

@jphme , thanks! That's my next plan. However, this approach relies partially on dataset implementation for cloud loading. We would need a separate (maybe similar) pipeline for model saving.

casper-hansen · 2023-11-15T18:17:44Z

How will we differentiate between downloading vs streaming from your cloud storage in the future when streaming is added? Could we just use the same path but differentiate between the two somehow?

NanoCode012 · 2023-11-16T02:15:31Z

How will we differentiate between downloading vs streaming from your cloud storage in the future when streaming is added? Could we just use the same path but differentiate between the two somehow?

Perhaps, we might have a stream: true parameter? Also, if I recall correctly, streaming would be with the Streaming library, right?

* Feat: Add dataset loading from S3, GCS * chore: update docs * chore: add more info on cloud loading

NanoCode012 mentioned this pull request Oct 22, 2023

Support streaming from cloud storage for downloading training data #585

Open

5 tasks

NanoCode012 commented Oct 22, 2023

View reviewed changes

src/axolotl/utils/data.py Outdated Show resolved Hide resolved

NanoCode012 requested a review from winglian November 10, 2023 04:36

NanoCode012 force-pushed the feat/cloud-loading branch from ec7ea57 to dc5fddf Compare November 15, 2023 14:53

winglian reviewed Nov 15, 2023

View reviewed changes

NanoCode012 added 3 commits November 16, 2023 13:51

Feat: Add dataset loading from S3, GCS

cf01825

chore: update docs

2083805

chore: add more info on cloud loading

51a530c

NanoCode012 force-pushed the feat/cloud-loading branch from 3054512 to 51a530c Compare November 16, 2023 04:52

NanoCode012 merged commit 3cc67d2 into axolotl-ai-cloud:main Nov 16, 2023
4 checks passed

NanoCode012 deleted the feat/cloud-loading branch November 16, 2023 05:33

mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023

Feat: Add dataset loading from S3, GCS (axolotl-ai-cloud#765)

8cddd96

* Feat: Add dataset loading from S3, GCS * chore: update docs * chore: add more info on cloud loading

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Add dataset loading from S3, GCS #765

Feat: Add dataset loading from S3, GCS #765

NanoCode012 commented Oct 22, 2023

winglian Nov 15, 2023

gordicaleksa Nov 17, 2023

NanoCode012 Nov 17, 2023

jphme commented Nov 15, 2023

NanoCode012 commented Nov 15, 2023

casper-hansen commented Nov 15, 2023

NanoCode012 commented Nov 16, 2023

Feat: Add dataset loading from S3, GCS #765

Feat: Add dataset loading from S3, GCS #765

Conversation

NanoCode012 commented Oct 22, 2023

winglian Nov 15, 2023

Choose a reason for hiding this comment

gordicaleksa Nov 17, 2023

Choose a reason for hiding this comment

NanoCode012 Nov 17, 2023

Choose a reason for hiding this comment

jphme commented Nov 15, 2023

NanoCode012 commented Nov 15, 2023

casper-hansen commented Nov 15, 2023

NanoCode012 commented Nov 16, 2023