Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make streaming use the correct number of unique samples with SP/TP #619

Merged
merged 4 commits into from
Mar 5, 2024

Conversation

snarayan21
Copy link
Collaborator

@snarayan21 snarayan21 commented Mar 5, 2024

Description of changes:

The TP/SP PR assumed that streaming would track/receive an increased number of samples when using SP/TP. For example, with replication = 2, Composer would previously track the # of elapsed samples without taking into account replication, which meant that even if 100 unique samples had been seen in training, since replication was 2, Composer would think that 200 samples had been seen, and pass that on to Streaming for resumption.

This issue was previously addressed by dividing the sample offset on resumption. With Composer now correctly calculating the number of unique samples elapsed during training, this fix is not needed on the Streaming side.

This change means we also need to track # unique samples using StreamingDataLoader.

Tested this with Shashank's help -- loss curves are overlapping on resumption.
Screenshot 2024-03-04 at 6 13 05 PM

Issue #, if available:

Merge Checklist:

Put an x without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the contributor guidelines
  • This is a documentation change or typo fix. If so, skip the rest of this checklist.
  • I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
  • I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

  • I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
  • I have added tests that prove my fix is effective or that my feature works (if appropriate).
  • I ran the tests locally to make sure it pass. (check out testing)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

Copy link
Collaborator

@karan6181 karan6181 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to valid this change using a unit test?

@snarayan21
Copy link
Collaborator Author

@karan6181 it'll only come up during resumption, so Shashank and I tested it on a real run. Image is above

@karan6181
Copy link
Collaborator

@karan6181 it'll only come up during resumption, so Shashank and I tested it on a real run. Image is above

Sync'd up offline. The plan is to add the test as part of integration testing.

@karan6181 karan6181 merged commit 74261f7 into main Mar 5, 2024
6 checks passed
@karan6181 karan6181 deleted the saaketh/tp_fix branch March 5, 2024 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants