Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly print checkpoint downloading exception #3131

Merged
merged 7 commits into from
Mar 21, 2024

Conversation

bigning
Copy link
Contributor

@bigning bigning commented Mar 20, 2024

What does this PR do?

PyTorch will hold any exception raised during this read_data, torch code pointer: https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/utils.py#L246-L255

PyTorch only raise the exception after communicate the exception across ranks https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/utils.py#L251, but issue is that communication iteself might fail because comms mis-match across ranks. In this case, it raised NCCL error, and hide the real read_data exceptioin.

This PR print the exception immediately. It still raise NCCL error, but at least, it prints the checkpoint downloading error.

test

on 8-gpu checkpoint load test, i manually raise exception on rank 1 in the for-loop, i can reproduce the c10::DistBackendError. With this PR, i can see the logging of the exception.

Copy link
Contributor

@j316chuck j316chuck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bigning can we add a unit test in test_checkpoint.py where download_file is mocked to raises an Exception? This will enforce that our try catch is working as expected consistently.

@mvpatel2000
Copy link
Contributor

i'm not sure I understand -- shouldn't this rank already raise the error?

@bigning
Copy link
Contributor Author

bigning commented Mar 20, 2024

i'm not sure I understand -- shouldn't this rank already raise the error?

@mvpatel2000 , pytorch captured this error, and then communicate this error, but unfortunuately, it failed to communicate and raised NCCL error. SO it never prints any information about the real error.

This PR just capture the error, print the infromation, and raise it again. So program still crash with nccl error, but at least it prints the real error before crash.

Copy link
Contributor

@eracah eracah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@bigning
Copy link
Contributor Author

bigning commented Mar 21, 2024

@bigning can we add a unit test in test_checkpoint.py where download_file is mocked to raises an Exception? This will enforce that our try catch is working as expected consistently.

@j316chuck do we really need unit test for this PR, it just add a logging. it doesn't change anything. It's not trivial to set up a test for this. I have to basically write a mockObjectStore, mockDistCPObjectStoreReader, a mockPlan. Then inside mockDistCPObjectStoreReader, i have to create a comlex self.storage_data to work with the mockPlan to make them pass this https://github.com/mosaicml/composer/blob/dev/composer/utils/checkpoint.py#L236-L237.
Even after all above set-up, it's very easy to break the unit test if we update the read_data implementation.

Or do you have a better idea to test this logging ?

Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'm fine with skipping test

composer/utils/checkpoint.py Outdated Show resolved Hide resolved
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
@bigning bigning enabled auto-merge (squash) March 21, 2024 17:42
@j316chuck
Copy link
Contributor

j316chuck commented Mar 21, 2024

@bigning thanks for investigating. For this task, it seems like this is too high lift, let's skip this for now. I think us moving towards Mocks at the design phase will allow us to simulate ExceptionHandling better in the long run and should be the way we go in the future fwiw.

@bigning bigning merged commit cf031e2 into dev Mar 21, 2024
14 checks passed
@bigning bigning deleted the explicitly_print_downloading_error branch March 21, 2024 18:31
j316chuck pushed a commit that referenced this pull request May 16, 2024
* a

* up

* up

* up

* up

* Update composer/utils/checkpoint.py

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

---------

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants