Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major overhaul of Streaming documentation #636

Merged
merged 36 commits into from
Apr 5, 2024
Merged

Conversation

snarayan21
Copy link
Collaborator

Description of changes:

We're overhauling Streaming documentation! The existing documentation had some rough edges, missing parts, lacked diagrams, and was at times hard to understand. Hopefully, with this, Streaming is much easier to use for customers, and the documentation becomes a great reference for users.

Further additions will include:

  • Using Streaming with various distributed launchers
  • some of the foundry tokenization/data scripts
  • potential refresh of the vision/text how-to guides

The docs are now structured as below:

  • Homepage [edited]
  • Getting Started
  • Preparing Datasets
  • Dataset Configuration
    • Shard retrieval [partially new]
    • Shuffling [partially new]
    • Replication and Sampling [partially new]
    • Mixing Datasets [partially new]
      Distributed Training
    • Requirements [partially new]
    • With Launchers [future PR]
    • Elastic determinism [new]
    • Fast resumption [new]
    • Performance tuning [partially new]
  • How-to Guides
    • Configure Cloud Storage Credentials [same] {@karan6181}
    • Text Data: Synthetic NLP [same] {@XiaohanZhangCMU}
    • Image Data: CIFAR-10 [same]
  • API Reference (from docstrings)

I've tagged people who should definitely look over/edit certain portions, but I'd also welcome feedback in all areas of the docs overhaul as well. Thanks!

Issue #, if available:

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the contributor guidelines
  • This is a documentation change or typo fix. If so, skip the rest of this checklist.
  • I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
  • I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

  • I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
  • I have added tests that prove my fix is effective or that my feature works (if appropriate).
  • I ran the tests locally to make sure it pass. (check out testing)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

docs/source/index.md Outdated Show resolved Hide resolved
docs/source/index.md Show resolved Hide resolved
docs/source/getting_started/main_concepts.md Show resolved Hide resolved
Copy link
Member

@XiaohanZhangCMU XiaohanZhangCMU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

To use StreamingDataset, one must convert raw data into one of our supported serialized dataset formats. With massive datasets, our serialization format choices are critical to the ultimate observed performance of the system. For deep learning models, we need extremely low latency cold random access of individual samples granularity to ensure that dataloading is not a bottleneck to training.

StreamingDataset is compatible with any data type, including **images**, **text**, **video**, and **multimodal** data. StreamingDataset supports the following formats:
* MDS (Mosaic Data Shard, most performant), through {class}`streaming.MDSWriter`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "most performant" mean?

it's possible to get marginally faster lookup with csv shards currrently

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most performant in terms of space and iteration time? is that not true?

@@ -0,0 +1,25 @@
# Dataset Format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utterly unmentioned is how the meta files work? what their entries look like in the index.json? how they can coexist, etc

suspect these docs are going to cause support traffic from people trying to 'bring their own' csv/jsonl/etc

Copy link
Collaborator Author

@snarayan21 snarayan21 Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knighton Mind including information about the meta files in a separate PR? the old docs don't have any information on that, and since you're assigned to this page and know the most about the format, you'd be best equipped to add a bit on it. I included a "Metadata" section that covers the index.json file, but not the meta section yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, if you could provide a blurb about the meta files that I can include here, that would be great too.

| Numerical String | 'str_int' | `StrInt` | stores in UTF-8 |
| Numerical String | 'str_float' | `StrFloat` | stores in UTF-8 |
| Numerical String | 'str_decimal' | `StrDecimal` | stores in UTF-8 |
| Image | 'pil' | `PIL` | raw PIL image |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when does one prefer pil vs png vs jpeg? what does raw mean?

Copy link
Collaborator Author

@snarayan21 snarayan21 Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether the user prefers to use it or not is up to them lol, but "raw" meaning this class. Will add link.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like it's just whether or not a user has png, jpeg, or uses the PIL image class directly

| Numpy Float | 'float16' | `Float16` | uses `numpy.float16` |
| Numpy Float | 'float32' | `Float32` | uses `numpy.float32` |
| Numpy Float | 'float64' | `Float64` | uses `numpy.float64` |
| Numerical String | 'str_int' | `StrInt` | stores in UTF-8 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would you use int vs int64 vs uint64 vs str_int? are they compatible with each other? why are they listed separately, so that even the existence of this choice users have available is non-obvious?

Copy link
Collaborator Author

@snarayan21 snarayan21 Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knighton It would be great if you could answer those questions for me / suggest edits -- I'm not sure why we have str_int for example. And what do you mean by compatible with each other? int is for Python and int64/uint64 is for numpy


5. An optional `compression` algorithm name (and level) if you would like to compress the shard files. This can reduce egress costs during training. StreamingDataset will uncompress shard files upon download during training. You can control whether to keep compressed shard files locally during training with the `keep_zip` flag -- more information [here](../dataset_configuration/shard_retrieval.md#Keeping-compressed-shards).

Supported compression algorithms:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is perfectly fine code in the repo to vis the different compression algos on (1) enc time, (2) enc size, and (3) dec time, and we should run the numbers on different kinds of shards and include the plots here because results are what people care about

same for hashing iirc

need to note that the levels of different algos, when they exist, do not map in any uniform way to each other

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to note that the levels of different algos, when they exist, do not map in any uniform way to each other

@knighton could you clarify this please? And add a suggestion if possible?

As for the visualization, that can go in a different PR as well, but yes that would be nice to add.

Copy link
Contributor

@knighton knighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snarayan21
Copy link
Collaborator Author

@knighton Thanks for the comments, I had some questions and could use your help with the needed edits. ty!

@snarayan21 snarayan21 requested a review from knighton April 4, 2024 16:01
Copy link
Contributor

@knighton knighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMing per offline discussion

@snarayan21 snarayan21 merged commit 24e9182 into main Apr 5, 2024
8 checks passed
@snarayan21 snarayan21 deleted the saaketh/docs_update branch April 5, 2024 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants