-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major overhaul of Streaming documentation #636
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
To use StreamingDataset, one must convert raw data into one of our supported serialized dataset formats. With massive datasets, our serialization format choices are critical to the ultimate observed performance of the system. For deep learning models, we need extremely low latency cold random access of individual samples granularity to ensure that dataloading is not a bottleneck to training. | ||
|
||
StreamingDataset is compatible with any data type, including **images**, **text**, **video**, and **multimodal** data. StreamingDataset supports the following formats: | ||
* MDS (Mosaic Data Shard, most performant), through {class}`streaming.MDSWriter` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "most performant" mean?
it's possible to get marginally faster lookup with csv shards currrently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most performant in terms of space and iteration time? is that not true?
@@ -0,0 +1,25 @@ | |||
# Dataset Format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utterly unmentioned is how the meta files work? what their entries look like in the index.json? how they can coexist, etc
suspect these docs are going to cause support traffic from people trying to 'bring their own' csv/jsonl/etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@knighton Mind including information about the meta
files in a separate PR? the old docs don't have any information on that, and since you're assigned to this page and know the most about the format, you'd be best equipped to add a bit on it. I included a "Metadata" section that covers the index.json
file, but not the meta
section yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, if you could provide a blurb about the meta
files that I can include here, that would be great too.
| Numerical String | 'str_int' | `StrInt` | stores in UTF-8 | | ||
| Numerical String | 'str_float' | `StrFloat` | stores in UTF-8 | | ||
| Numerical String | 'str_decimal' | `StrDecimal` | stores in UTF-8 | | ||
| Image | 'pil' | `PIL` | raw PIL image | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when does one prefer pil
vs png
vs jpeg
? what does raw mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether the user prefers to use it or not is up to them lol, but "raw" meaning this class. Will add link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like it's just whether or not a user has png, jpeg, or uses the PIL image class directly
| Numpy Float | 'float16' | `Float16` | uses `numpy.float16` | | ||
| Numpy Float | 'float32' | `Float32` | uses `numpy.float32` | | ||
| Numpy Float | 'float64' | `Float64` | uses `numpy.float64` | | ||
| Numerical String | 'str_int' | `StrInt` | stores in UTF-8 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when would you use int
vs int64
vs uint64
vs str_int
? are they compatible with each other? why are they listed separately, so that even the existence of this choice users have available is non-obvious?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@knighton It would be great if you could answer those questions for me / suggest edits -- I'm not sure why we have str_int
for example. And what do you mean by compatible with each other? int
is for Python and int64/uint64
is for numpy
|
||
5. An optional `compression` algorithm name (and level) if you would like to compress the shard files. This can reduce egress costs during training. StreamingDataset will uncompress shard files upon download during training. You can control whether to keep compressed shard files locally during training with the `keep_zip` flag -- more information [here](../dataset_configuration/shard_retrieval.md#Keeping-compressed-shards). | ||
|
||
Supported compression algorithms: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is perfectly fine code in the repo to vis the different compression algos on (1) enc time, (2) enc size, and (3) dec time, and we should run the numbers on different kinds of shards and include the plots here because results are what people care about
same for hashing iirc
need to note that the levels of different algos, when they exist, do not map in any uniform way to each other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to note that the levels of different algos, when they exist, do not map in any uniform way to each other
@knighton could you clarify this please? And add a suggestion if possible?
As for the visualization, that can go in a different PR as well, but yes that would be nice to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@knighton Thanks for the comments, I had some questions and could use your help with the needed edits. ty! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTMing per offline discussion
Description of changes:
We're overhauling Streaming documentation! The existing documentation had some rough edges, missing parts, lacked diagrams, and was at times hard to understand. Hopefully, with this, Streaming is much easier to use for customers, and the documentation becomes a great reference for users.
Further additions will include:
The docs are now structured as below:
Distributed Training
I've tagged people who should definitely look over/edit certain portions, but I'd also welcome feedback in all areas of the docs overhaul as well. Thanks!
Issue #, if available:
Merge Checklist:
Put an
x
without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
pre-commit
on my change. (check out thepre-commit
section of prerequisites)