Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanded range shuffle #394

Merged
merged 8 commits into from
Aug 30, 2023
Merged

Conversation

snarayan21
Copy link
Collaborator

@snarayan21 snarayan21 commented Aug 23, 2023

Description of changes:

Implements a new shuffle - the "expanded range" shuffle, where the range that a shard's samples can appear is expanded, and samples are randomly placed within that range.

Helpful slides: https://docs.google.com/presentation/d/1UHijFFgA0IPUxiOVv4aevGclSc83HKfJ6ae3PZtzc0c/edit?usp=sharing

Suppose we have 10 shards in a canonical node, each shard has 100 samples, and our shuffle block size is set to 500 samples. With py1e, for each shard, its samples will now be distributed over a window of maximum size 500 (equal to shuffle block size). However, these windows cannot cross canonical node boundaries because we don't want overlap between samples from different canonical nodes. So shard 1 will have a window of 300 samples (the max window is 500 samples, but we can't cross the CN boundary, so our window is shortened to 300 samples), shard 2, 400 samples (again, max window is 500 samples, but we can't cross the CN boundary, so our window is shortened to 400 samples), and shard 3-8 will have windows of 500 samples, shard 9, similar to shard 2, has a window of 400 samples (because it can't cross the end of the canonical node), and shard 10 has a window of 300 samples.

Within a canonical node, the total number of shards we need (assuming all shards are the same size) is given by SBS/(# samples per shard). This is the same for algorithms like py1b and py1br. However, when py1b and py1br cross canonical node boundaries, because there is some predownload, as training approaches the end of a canonical node, the predownload will look ahead into the next canonical node. Because the first shuffle block in the next canonical node is shuffled, this means our predownload likely will need to download many shards for the upcoming shuffle block. This results in a spike in downloading, and requires a higher cache limit to store these shards without any negative impact to throughput.

In contrast, with py1e shuffling, as you approach the end of a canonical node, the number of shards you need to proceed with training through the current canonical node approaches 0.5*(SBS/(# samples per shard)). Similarly, at the start of the next canonical node, the number of shards needed to fulfill the first few batches also starts at 0.5*(SBS/(# samples per shard)). This means that I can maintain a small predownload to look ahead into the next canonical node with a lower cache limit, since in total, I need 0.5*(SBS/(# samples per shard)) + 0.5*(SBS/(# samples per shard)) = SBS/(# samples per shard), meaning that the number of shards I need to store per node is constant throughout training.

Additionally, downloading is also more balanced since the shards I need to download steadily ramps up from 0.5*SBS/(# samples per shard) in the beginning of a canonical node to SBS/(# samples per shard). This results in a more smooth downloading curve than algorithms like py1b or even py1br, which both download all the shards needed for a shuffle block within the span of a few batches.

Issue #, if available:

https://mosaicml.atlassian.net/browse/STR-127

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the contributor guidelines
  • This is a documentation change or typo fix. If so, skip the rest of this checklist.
  • I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
  • I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

  • I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
  • I have added tests that prove my fix is effective or that my feature works (if appropriate).
  • I ran the tests locally to make sure it pass. (check out testing)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

Copy link
Contributor

@knighton knighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor nits then LFG

@snarayan21 snarayan21 merged commit aa2a75e into mosaicml:main Aug 30, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants