Expanded range shuffle #394

snarayan21 · 2023-08-23T06:23:04Z

Description of changes:

Implements a new shuffle - the "expanded range" shuffle, where the range that a shard's samples can appear is expanded, and samples are randomly placed within that range.

Helpful slides: https://docs.google.com/presentation/d/1UHijFFgA0IPUxiOVv4aevGclSc83HKfJ6ae3PZtzc0c/edit?usp=sharing

Suppose we have 10 shards in a canonical node, each shard has 100 samples, and our shuffle block size is set to 500 samples. With py1e, for each shard, its samples will now be distributed over a window of maximum size 500 (equal to shuffle block size). However, these windows cannot cross canonical node boundaries because we don't want overlap between samples from different canonical nodes. So shard 1 will have a window of 300 samples (the max window is 500 samples, but we can't cross the CN boundary, so our window is shortened to 300 samples), shard 2, 400 samples (again, max window is 500 samples, but we can't cross the CN boundary, so our window is shortened to 400 samples), and shard 3-8 will have windows of 500 samples, shard 9, similar to shard 2, has a window of 400 samples (because it can't cross the end of the canonical node), and shard 10 has a window of 300 samples.

Within a canonical node, the total number of shards we need (assuming all shards are the same size) is given by SBS/(# samples per shard). This is the same for algorithms like py1b and py1br. However, when py1b and py1br cross canonical node boundaries, because there is some predownload, as training approaches the end of a canonical node, the predownload will look ahead into the next canonical node. Because the first shuffle block in the next canonical node is shuffled, this means our predownload likely will need to download many shards for the upcoming shuffle block. This results in a spike in downloading, and requires a higher cache limit to store these shards without any negative impact to throughput.

In contrast, with py1e shuffling, as you approach the end of a canonical node, the number of shards you need to proceed with training through the current canonical node approaches 0.5*(SBS/(# samples per shard)). Similarly, at the start of the next canonical node, the number of shards needed to fulfill the first few batches also starts at 0.5*(SBS/(# samples per shard)). This means that I can maintain a small predownload to look ahead into the next canonical node with a lower cache limit, since in total, I need 0.5*(SBS/(# samples per shard)) + 0.5*(SBS/(# samples per shard)) = SBS/(# samples per shard), meaning that the number of shards I need to store per node is constant throughout training.

Additionally, downloading is also more balanced since the shards I need to download steadily ramps up from 0.5*SBS/(# samples per shard) in the beginning of a canonical node to SBS/(# samples per shard). This results in a more smooth downloading curve than algorithms like py1b or even py1br, which both download all the shards needed for a shuffle block within the span of a few batches.

Issue #, if available:

https://mosaicml.atlassian.net/browse/STR-127

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

I have read the contributor guidelines
This is a documentation change or typo fix. If so, skip the rest of this checklist.
I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
I have added tests that prove my fix is effective or that my feature works (if appropriate).
I ran the tests locally to make sure it pass. (check out testing)
I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

streaming/base/shuffle/py1e.py

knighton

some minor nits then LFG

snarayan21 added 6 commits August 21, 2023 14:45

py1g in progress

230379d

py1e algo implementation

f09986a

py1e algo implementation

dc1dac5

added py1e algorithm -- extended range

2b6f79b

merged with main

99a4f29

merged with main

d580d0f

snarayan21 force-pushed the expanded_range_shuffle branch from 13694b6 to d580d0f Compare August 29, 2023 18:26