Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support horovod #524

Merged
merged 19 commits into from
Apr 24, 2021
Merged

Support horovod #524

merged 19 commits into from
Apr 24, 2021

Conversation

zuston
Copy link
Member

@zuston zuston commented Apr 9, 2021

No description provided.

@zuston zuston marked this pull request as draft April 9, 2021 06:43
@zuston
Copy link
Member Author

zuston commented Apr 9, 2021

This PR is just to support static horovod by using gloo controller. Elastic horovod will support in later PR.

I will attach design doc later.

@zuston zuston requested a review from oliverhu April 9, 2021 09:57
@oliverhu
Copy link
Member

Thanks for the PR!! Looks good at a glance, not very sure how Horovod works elastically, looking forward to the design doc! Please also add some integration tests, and examples.

@zuston
Copy link
Member Author

zuston commented Apr 10, 2021

Attached local test case.
https://github.com/zuston/horovod-yarn

@zuston
Copy link
Member Author

zuston commented Apr 11, 2021

Thanks for the PR!! Looks good at a glance, not very sure how Horovod works elastically, looking forward to the design doc! Please also add some integration tests, and examples.

Example and proposal are all provided. Please check it.
More tests will be added in the few days. @oliverhu

@zuston zuston force-pushed the github-horovod branch 2 times, most recently from e8208e4 to 2c3c8af Compare April 12, 2021 11:08
@zuston zuston marked this pull request as ready for review April 12, 2021 11:09
@zuston zuston linked an issue Apr 12, 2021 that may be closed by this pull request
@oliverhu
Copy link
Member

@erwa @goyalankit could you also review this? We tried to resolve this before but didn't get a chance to finish.

# Horovod: save checkpoints only on worker 0 to prevent other workers from
# corrupting it.
if hvd.rank() == 0:
checkpoint.save(checkpoint_dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a new line below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@erwa
Copy link
Contributor

erwa commented Apr 13, 2021

@erwa @goyalankit could you also review this? We tried to resolve this before but didn't get a chance to finish.

Saw you mentioned me. I don't think I'm the best to review given I haven't touched TonY code in almost two years lol. I'm not familiar with the new stages in TonY or the new stuff in Horovod (though it's cool to see we introduced new stages to support the Redis use case, which I remember well :).

One general comment, though, is that I see there is some refactoring work in this PR, too, such as the MLFrameworkRuntime refactoring and adding a lot of getters to TaskExecutor.java. I think the refactoring is nice (the MLFrameworkRuntime seems like a clean interface) but I think it should be done in a separate PR first to keep the PRs smaller and more discrete. Just a suggestion -- I'll let you guys who are actively working on/maintaining TonY make the call.

(P.S.: One small nit in the MLFrameworkRuntime interface: there's a typo: destory -> destroy)

@zuston zuston requested a review from oliverhu April 14, 2021 03:39
@zuston
Copy link
Member Author

zuston commented Apr 14, 2021

Thanks for your comments @erwa

One general comment, though, is that I see there is some refactoring work in this PR, too, such as the MLFrameworkRuntime refactoring and adding a lot of getters to TaskExecutor.java. I think the refactoring is nice (the MLFrameworkRuntime seems like a clean interface) but I think it should be done in a separate PR first to keep the PRs smaller and more discrete.

A good suggestion! I will refactor it in another separate PR.

(P.S.: One small nit in the MLFrameworkRuntime interface: there's a typo: destory -> destroy)

Fixed.

@zuston zuston requested a review from oliverhu April 21, 2021 04:05
Copy link
Member

@oliverhu oliverhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments, overall looks good! pls fix the ci

Comment on lines +169 to +170
private static boolean existPortFile(Path parentPath) throws IOException {
return getServerInfo(parentPath).getLeft() != -1 ? true : false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will there be a race condition on port allocation if there are a horovod job and a non-horovod job run in parallel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't happen. The port files will be created in independent application folder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not file conflict. I meant port conflict. We had this issue before that if you launch multiple TF jobs on the same node, there might be a race condition. The problem was, we reserve a port in Java for the job, and later release it before the job starts. However, the job might not start right away and would take a few minutes to set up. During that few minutes, another TF job could take up that port and causes port conflict.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't happen. In horovod driver, It is Python process(built-in horovod_driver.py) that selects an available port, which is not TonY reserved. Therefore, port race condition will not occur.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, makes sense

@zuston
Copy link
Member Author

zuston commented Apr 23, 2021

Fix all. please review again @oliverhu

@oliverhu
Copy link
Member

lgtm, thanks for the great work!!

@oliverhu oliverhu merged commit 00e88b0 into tony-framework:master Apr 24, 2021
@zuston zuston deleted the github-horovod branch July 10, 2021 01:41
zuston added a commit to zuston/TonY that referenced this pull request Feb 9, 2022
PR: tony-framework#524

* Support horovod

* Optimize horovod_driver code

* Optimize horovod driver test case

* Add horovod example

* Add some todo issue and remove test temporarily

* Add design proposal

* Add more tests

* Remove horovod_driver in test resources and optimize doc

* Start driver on task executor

* Add registerCallbackInfo rpc call and adopt it

* fix bug of build workerlist in horovod

* Update proposal

* Add horovod should pass test

* Fix some bugs

* Add file newline character

* Add more comment on code

* Fix test case failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Horovod in TonY
4 participants