-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicitly print checkpoint downloading exception #3131
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bigning can we add a unit test in test_checkpoint.py
where download_file is mocked to raises an Exception? This will enforce that our try catch is working as expected consistently.
i'm not sure I understand -- shouldn't this rank already raise the error? |
@mvpatel2000 , pytorch captured this error, and then communicate this error, but unfortunuately, it failed to communicate and raised NCCL error. SO it never prints any information about the real error. This PR just capture the error, print the infromation, and raise it again. So program still crash with nccl error, but at least it prints the real error before crash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
@j316chuck do we really need unit test for this PR, it just add a logging. it doesn't change anything. It's not trivial to set up a test for this. I have to basically write a mockObjectStore, mockDistCPObjectStoreReader, a mockPlan. Then inside mockDistCPObjectStoreReader, i have to create a comlex self.storage_data to work with the mockPlan to make them pass this https://github.com/mosaicml/composer/blob/dev/composer/utils/checkpoint.py#L236-L237. Or do you have a better idea to test this logging ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I'm fine with skipping test
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
@bigning thanks for investigating. For this task, it seems like this is too high lift, let's skip this for now. I think us moving towards Mocks at the design phase will allow us to simulate ExceptionHandling better in the long run and should be the way we go in the future fwiw. |
* a * up * up * up * up * Update composer/utils/checkpoint.py Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com> --------- Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
What does this PR do?
PyTorch will hold any exception raised during this read_data, torch code pointer: https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/utils.py#L246-L255
PyTorch only raise the exception after communicate the exception across ranks https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/utils.py#L251, but issue is that communication iteself might fail because comms mis-match across ranks. In this case, it raised NCCL error, and hide the real read_data exceptioin.
This PR print the exception immediately. It still raise NCCL error, but at least, it prints the checkpoint downloading error.
test
on 8-gpu checkpoint load test, i manually raise exception on rank 1 in the for-loop, i can reproduce the
c10::DistBackendError
. With this PR, i can see the logging of the exception.