Major overhaul of Streaming documentation #636

snarayan21 · 2024-03-22T00:52:40Z

Description of changes:

We're overhauling Streaming documentation! The existing documentation had some rough edges, missing parts, lacked diagrams, and was at times hard to understand. Hopefully, with this, Streaming is much easier to use for customers, and the documentation becomes a great reference for users.

Further additions will include:

Using Streaming with various distributed launchers
some of the foundry tokenization/data scripts
potential refresh of the vision/text how-to guides

The docs are now structured as below:

Homepage [edited]
Getting Started
- Quick Start [edited]
- Main Concepts [new]
- FAQs & Tips [new] {team review: @knighton @XiaohanZhangCMU @karan6181}
Preparing Datasets
- Dataset Format [edited] {@knighton}
- Basic dataset conversion [edited] {@knighton}
- Parallel dataset conversion [edited] {@XiaohanZhangCMU}
- Spark Dataframe to MDS [edited] {@XiaohanZhangCMU}
Dataset Configuration
- Shard retrieval [partially new]
- Shuffling [partially new]
- Replication and Sampling [partially new]
- Mixing Datasets [partially new]
  Distributed Training
- Requirements [partially new]
- With Launchers [future PR]
- Elastic determinism [new]
- Fast resumption [new]
- Performance tuning [partially new]
How-to Guides
- Configure Cloud Storage Credentials [same] {@karan6181}
- Text Data: Synthetic NLP [same] {@XiaohanZhangCMU}
- Image Data: CIFAR-10 [same]
API Reference (from docstrings)

I've tagged people who should definitely look over/edit certain portions, but I'd also welcome feedback in all areas of the docs overhaul as well. Thanks!

Issue #, if available:

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

I have read the contributor guidelines
This is a documentation change or typo fix. If so, skip the rest of this checklist.
I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
I have added tests that prove my fix is effective or that my feature works (if appropriate).
I ran the tests locally to make sure it pass. (check out testing)
I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

review-notebook-app · 2024-03-22T00:52:45Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

docs/source/index.md

docs/source/getting_started/main_concepts.md

docs/source/preparing_datasets/dataset_format.md

docs/source/distributed_training/performance_tuning.md

…eaming into saaketh/docs_update merging origin

…eaming into saaketh/docs_update merging remote

XiaohanZhangCMU

LGTM

docs/source/preparing_datasets/basic_dataset_conversion.md

knighton · 2024-04-03T18:13:31Z

docs/source/preparing_datasets/dataset_format.md

+To use StreamingDataset, one must convert raw data into one of our supported serialized dataset formats. With massive datasets, our serialization format choices are critical to the ultimate observed performance of the system. For deep learning models, we need extremely low latency cold random access of individual samples granularity to ensure that dataloading is not a bottleneck to training.
+
+StreamingDataset is compatible with any data type, including **images**, **text**, **video**, and **multimodal** data. StreamingDataset supports the following formats:
+ * MDS (Mosaic Data Shard, most performant), through {class}`streaming.MDSWriter`


what does "most performant" mean?

it's possible to get marginally faster lookup with csv shards currrently

most performant in terms of space and iteration time? is that not true?

docs/source/preparing_datasets/dataset_format.md

knighton · 2024-04-03T18:32:53Z

docs/source/preparing_datasets/dataset_format.md

@@ -0,0 +1,25 @@
+# Dataset Format


utterly unmentioned is how the meta files work? what their entries look like in the index.json? how they can coexist, etc

suspect these docs are going to cause support traffic from people trying to 'bring their own' csv/jsonl/etc

@knighton Mind including information about the meta files in a separate PR? the old docs don't have any information on that, and since you're assigned to this page and know the most about the format, you'd be best equipped to add a bit on it. I included a "Metadata" section that covers the index.json file, but not the meta section yet.

Or, if you could provide a blurb about the meta files that I can include here, that would be great too.

docs/source/preparing_datasets/basic_dataset_conversion.md

knighton · 2024-04-03T18:36:14Z

docs/source/preparing_datasets/basic_dataset_conversion.md

+| Numerical String   | 'str_int'     | `StrInt`     | stores in UTF-8          |
+| Numerical String   | 'str_float'   | `StrFloat`   | stores in UTF-8          |
+| Numerical String   | 'str_decimal' | `StrDecimal` | stores in UTF-8          |
+| Image              | 'pil'         | `PIL`        | raw PIL image            |


when does one prefer pil vs png vs jpeg? what does raw mean?

Whether the user prefers to use it or not is up to them lol, but "raw" meaning this class. Will add link.

Like it's just whether or not a user has png, jpeg, or uses the PIL image class directly

knighton · 2024-04-03T18:38:15Z

docs/source/preparing_datasets/basic_dataset_conversion.md

+| Numpy Float        | 'float16'     | `Float16`    | uses  `numpy.float16`    |
+| Numpy Float        | 'float32'     | `Float32`    | uses  `numpy.float32`    |
+| Numpy Float        | 'float64'     | `Float64`    | uses  `numpy.float64`    |
+| Numerical String   | 'str_int'     | `StrInt`     | stores in UTF-8          |


when would you use int vs int64 vs uint64 vs str_int? are they compatible with each other? why are they listed separately, so that even the existence of this choice users have available is non-obvious?

@knighton It would be great if you could answer those questions for me / suggest edits -- I'm not sure why we have str_int for example. And what do you mean by compatible with each other? int is for Python and int64/uint64 is for numpy

docs/source/preparing_datasets/basic_dataset_conversion.md

knighton · 2024-04-03T18:45:25Z

docs/source/preparing_datasets/basic_dataset_conversion.md

+
+5. An optional `compression` algorithm name (and level) if you would like to compress the shard files. This can reduce egress costs during training. StreamingDataset will uncompress shard files upon download during training. You can control whether to keep compressed shard files locally during training with the `keep_zip` flag -- more information [here](../dataset_configuration/shard_retrieval.md#Keeping-compressed-shards).
+
+Supported compression algorithms:


there is perfectly fine code in the repo to vis the different compression algos on (1) enc time, (2) enc size, and (3) dec time, and we should run the numbers on different kinds of shards and include the plots here because results are what people care about

same for hashing iirc

need to note that the levels of different algos, when they exist, do not map in any uniform way to each other

need to note that the levels of different algos, when they exist, do not map in any uniform way to each other

@knighton could you clarify this please? And add a suggestion if possible?

As for the visualization, that can go in a different PR as well, but yes that would be nice to add.

docs/source/dataset_configuration/mixing_data_sources.md

knighton

snarayan21 · 2024-04-03T21:39:12Z

@knighton Thanks for the comments, I had some questions and could use your help with the needed edits. ty!

knighton

LGTMing per offline discussion

snarayan21 added 10 commits February 28, 2024 15:10

getting_started section updated

642d128

preparing datasets section wip

af8e34c

wrapped up preparing datasets

afa2dfc

shard retrieval

e40f5a4

added dataset configuration folder

417c57f

starting on distributed trainng

ede6b43

added a tip for shard sample access

1d88da6

complete

458c6c6

deleted old files

1b4a6ce

main docs rewrite

70ee1fe

merging main

7f4e3a2

snarayan21 requested review from karan6181, bandish-shah, knighton and XiaohanZhangCMU March 22, 2024 00:54

snarayan21 and others added 2 commits March 21, 2024 18:15

no tests for code segments

ae7d20f

Merge branch 'main' into saaketh/docs_update

c7f256d

karan6181 reviewed Mar 25, 2024

View reviewed changes

docs/source/index.md Outdated Show resolved Hide resolved

docs/source/index.md Show resolved Hide resolved

docs/source/getting_started/main_concepts.md Show resolved Hide resolved

karan6181 reviewed Mar 25, 2024

View reviewed changes

docs/source/preparing_datasets/dataset_format.md Outdated Show resolved Hide resolved

docs/source/distributed_training/performance_tuning.md Show resolved Hide resolved

snarayan21 and others added 10 commits March 25, 2024 18:19

note

2b5e00f

docs

b3d7a5b

Merge branch 'saaketh/docs_update' of https://github.com/mosaicml/str…

e2db2dc

…eaming into saaketh/docs_update merging origin

math rendered?

e09f2dd

Merge branch 'main' into saaketh/docs_update

d0d26e9

dependency hell

00b73a9

Merge branch 'saaketh/docs_update' of https://github.com/mosaicml/str…

80f2534

…eaming into saaketh/docs_update merging remote

doc fixes

fe6e54a

Merge branch 'main' into saaketh/docs_update

8f29527

doc fixes

b698853

XiaohanZhangCMU reviewed Apr 3, 2024

View reviewed changes

addressed james comments

b41007f

snarayan21 requested review from XiaohanZhangCMU, karan6181 and knighton April 3, 2024 17:47

removed testing file

74b8c39