Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Composer object store download retry #3140

Merged
merged 17 commits into from
Mar 22, 2024
Merged

Composer object store download retry #3140

merged 17 commits into from
Mar 22, 2024

Conversation

bigning
Copy link
Contributor

@bigning bigning commented Mar 22, 2024

What does this PR do?

add object_store checkpoint downloading retry

test

bigning-debug-cp-loading-2-J6z0wB ConnectionError
in the logging, there is one rank succeeded at 2nd attempt.
image

another run: bigning-debug-cp-loading-2-9Lq8am ChunkedEncodingError
image

@bigning bigning changed the title Composer retry [wip] Composer retry Mar 22, 2024
@bigning bigning changed the title [wip] Composer retry Composer object store download retry Mar 22, 2024
Copy link
Contributor

@b-chu b-chu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some failing tests but overall looks good. Let's clean up the unused import and pyright ignores. This would retry with every exception though, not just the ones that aren't retried from the object store. So you would 5x the number of retries that were already happening.

@bigning
Copy link
Contributor Author

bigning commented Mar 22, 2024

So you would 5x the number of retries that were already happening.

@b-chu , for the retries that already happening, if the oci internal retry works, then we don't do extra retry. Only if the internal retry fails, we do extra retry (i think that's kind of what we want. :) ).

btw, the test error is unrelated, it's fixed in this PR #3142

Copy link
Contributor

@b-chu b-chu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor comments

composer/utils/checkpoint.py Outdated Show resolved Hide resolved
composer/utils/checkpoint.py Outdated Show resolved Hide resolved
composer/utils/checkpoint.py Outdated Show resolved Hide resolved
@bigning bigning requested a review from b-chu March 22, 2024 17:57
Copy link
Contributor

@b-chu b-chu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the cleanup!

@bigning bigning merged commit 1740040 into dev Mar 22, 2024
14 checks passed
@bigning bigning deleted the composer_retry branch March 22, 2024 19:51
j316chuck pushed a commit that referenced this pull request May 16, 2024
* retry

* up

* up

* up

* fix

* up

* a

* up

* up

* up

* up

* lint

* up

* up

* lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants