-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support horovod #524
Support horovod #524
Conversation
This PR is just to support static horovod by using gloo controller. Elastic horovod will support in later PR. I will attach design doc later. |
Thanks for the PR!! Looks good at a glance, not very sure how Horovod works elastically, looking forward to the design doc! Please also add some integration tests, and examples. |
Attached local test case. |
Example and proposal are all provided. Please check it. |
e8208e4
to
2c3c8af
Compare
@erwa @goyalankit could you also review this? We tried to resolve this before but didn't get a chance to finish. |
# Horovod: save checkpoints only on worker 0 to prevent other workers from | ||
# corrupting it. | ||
if hvd.rank() == 0: | ||
checkpoint.save(checkpoint_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a new line below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Saw you mentioned me. I don't think I'm the best to review given I haven't touched TonY code in almost two years lol. I'm not familiar with the new stages in TonY or the new stuff in Horovod (though it's cool to see we introduced new stages to support the Redis use case, which I remember well :). One general comment, though, is that I see there is some refactoring work in this PR, too, such as the MLFrameworkRuntime refactoring and adding a lot of getters to TaskExecutor.java. I think the refactoring is nice (the MLFrameworkRuntime seems like a clean interface) but I think it should be done in a separate PR first to keep the PRs smaller and more discrete. Just a suggestion -- I'll let you guys who are actively working on/maintaining TonY make the call. (P.S.: One small nit in the MLFrameworkRuntime interface: there's a typo: destory -> destroy) |
Thanks for your comments @erwa
A good suggestion! I will refactor it in another separate PR.
Fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments, overall looks good! pls fix the ci
private static boolean existPortFile(Path parentPath) throws IOException { | ||
return getServerInfo(parentPath).getLeft() != -1 ? true : false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will there be a race condition on port allocation if there are a horovod job and a non-horovod job run in parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't happen. The port files will be created in independent application folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not file conflict. I meant port conflict. We had this issue before that if you launch multiple TF jobs on the same node, there might be a race condition. The problem was, we reserve a port in Java for the job, and later release it before the job starts. However, the job might not start right away and would take a few minutes to set up. During that few minutes, another TF job could take up that port and causes port conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't happen. In horovod driver, It is Python process(built-in horovod_driver.py) that selects an available port, which is not TonY reserved. Therefore, port race condition will not occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, makes sense
tony-core/src/main/java/com/linkedin/tony/runtime/HorovodRuntime.java
Outdated
Show resolved
Hide resolved
Fix all. please review again @oliverhu |
lgtm, thanks for the great work!! |
PR: tony-framework#524 * Support horovod * Optimize horovod_driver code * Optimize horovod driver test case * Add horovod example * Add some todo issue and remove test temporarily * Add design proposal * Add more tests * Remove horovod_driver in test resources and optimize doc * Start driver on task executor * Add registerCallbackInfo rpc call and adopt it * fix bug of build workerlist in horovod * Update proposal * Add horovod should pass test * Fix some bugs * Add file newline character * Add more comment on code * Fix test case failed
No description provided.