Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
We don't use eval state at all -- currently we simply set the eval dataloader state only if the dataset is a StreamingDataset, and we also simply set the current sample to 0 (meaning that we start from the beginning). This is not needed at all and is causing some errors when resuming training on various numbers of GPUs. Since we iterate through the entire eval loader every single time we do eval, there is no notion of state that has to be preserved for eval datasets. This PR removes that functionality completely.
Specifically, this caused errors with resuming on various numbers of GPUs and setting
device_eval_batch_size
. When going from 24 to 32 gpus, with both runs specifyingdevice_eval_batch_size
of 2, Streaming would try to resume using a global eval batch size of 64 samples, but the state dict would also show that 24 devices had been used in the initial run. To deterministically resume, Streaming attempts to partition the current global batch size over the initial number of devices, and 64 samples are not divisible by 24 devices, throwing an error.Test runs have succeeded:
1b-dense-fsdp-shardgradop-lion8b-fullckpt-resume-s44ELm
: resume from old checkpoint correctly1b-dense-fsdp-shardgradop-lion8b-fullckpt-start-HDVoIC
: save new checkpoint with PR1b-dense-fsdp-shardgradop-lion8b-fullckpt-resume-VPH2X6
: resume from new checkpoint correctlyWhat issue(s) does this change relate to?
Before submitting
pre-commit
on your change? (see thepre-commit
section of prerequisites)