Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[checkpoint v2] add remote uploader class #3303

Merged
merged 15 commits into from
May 31, 2024
Merged

Conversation

bigning
Copy link
Contributor

@bigning bigning commented May 17, 2024

What does this PR do?

Add a RemoteUploader class
Compared to RemoteUploaderDownloader, it

  1. removes the download API, which is not used
  2. uses modern ProcessPoolExecutor which provides high-level interface for asynchronously executing callables
  3. greatly simplify the logic, it doesn't use any queues.
  4. it's neither a logger nor a callback, so it's very light weight and standalone component. Instead, it's caller's responsibility to wait the upload to finish
  5. much easier to maintain and add more features, since it's simpler

test

unit test

pytest tests/utils/test_remote_uploader.py

e2e test

see PR #3320

@bigning bigning marked this pull request as ready for review May 29, 2024 18:21
@bigning bigning changed the title [WIP] add remote uploader [checkpoint v2] add remote uploader May 29, 2024
@bigning bigning changed the title [checkpoint v2] add remote uploader [checkpoint v2] add remote uploader class May 29, 2024
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid race conditions with symlink and checkpoints, we do num_workers=1. Should we just remove multi worker functionality (remove public arg)?

Otherwise, @eracah does the redesign have a way to fix this?

@eracah
Copy link
Contributor

eracah commented May 29, 2024

To avoid race conditions with symlink and checkpoints, we do num_workers=1. Should we just remove multi worker functionality (remove public arg)?

Otherwise, @eracah does the redesign have a way to fix this?

it will fix this because we will upload the symlink synchronously after the file(s) in the upload_worker

@bigning
Copy link
Contributor Author

bigning commented May 29, 2024

To avoid race conditions with symlink and checkpoints, we do num_workers=1. Should we just remove multi worker functionality (remove public arg)?
Otherwise, @eracah does the redesign have a way to fix this?

it will fix this because we will upload the symlink synchronously after the file(s) in the upload_worker

This version just does exact same thing as existing remote_uploader_downloader.

Copy link
Contributor

@eracah eracah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start, @bigning ! I think it could be a lot simpler if we move to a design where on each upload_file_async call a new executor.submit call is called. This would make it even simpler because we wouldn't need a file queue or an is_closed_event. See the design doc implementation section for more details

composer/utils/remote_uploader.py Outdated Show resolved Hide resolved
composer/utils/remote_uploader.py Outdated Show resolved Hide resolved
composer/utils/remote_uploader.py Outdated Show resolved Hide resolved
composer/utils/remote_uploader.py Outdated Show resolved Hide resolved
@eracah
Copy link
Contributor

eracah commented May 29, 2024

To avoid race conditions with symlink and checkpoints, we do num_workers=1. Should we just remove multi worker functionality (remove public arg)?
Otherwise, @eracah does the redesign have a way to fix this?

it will fix this because we will upload the symlink synchronously after the file(s) in the upload_worker

This version just does exact same thing as existing remote_uploader_downloader.

Right, but in the design doc, I designed a feature where the symlink gets uploaded synchronously after the file. Is that intended for a follow-up PR?

@bigning
Copy link
Contributor Author

bigning commented May 30, 2024

This is a good start, @bigning ! I think it could be a lot simpler if we move to a design where on each upload_file_async call a new executor.submit call is called. This would make it even simpler because we wouldn't need a file queue or an is_closed_event. See the design doc implementation section for more details

@eracah , that's actually my initial implementation (look at this), but it doesn't work unless we create a new object_store everytime we call upload_file. This is because, with mutli-process, it needs to pickle the input to the function, the object_store is not pickable. Does that make sense ?

@b-chu b-chu removed their request for review May 30, 2024 15:21
@b-chu
Copy link
Contributor

b-chu commented May 30, 2024

I'll leave the reviewing to others but adding myself to cc

@bigning
Copy link
Contributor Author

bigning commented May 30, 2024

To avoid race conditions with symlink and checkpoints, we do num_workers=1. Should we just remove multi worker functionality (remove public arg)?
Otherwise, @eracah does the redesign have a way to fix this?

it will fix this because we will upload the symlink synchronously after the file(s) in the upload_worker

This version just does exact same thing as existing remote_uploader_downloader.

Right, but in the design doc, I designed a feature where the symlink gets uploaded synchronously after the file. Is that intended for a follow-up PR?

discussed offline, this is a separate PR to add the new checkpoint saver

Copy link
Contributor

@eracah eracah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bigning bigning enabled auto-merge (squash) May 31, 2024 00:19
@bigning bigning merged commit a6c4e43 into dev May 31, 2024
15 checks passed
@bigning bigning deleted the add_remote_file_uploader branch May 31, 2024 00:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants