Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consensus protocol #39

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Sergei-Lebedev
Copy link
Contributor

No description provided.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 13, 2021
@facebook-github-bot
Copy link
Contributor

@kingchc has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@kingchc has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@kingchc
Copy link
Contributor

kingchc commented Nov 1, 2021

resolved conflict (cc @Sergei-Lebedev)

@facebook-github-bot
Copy link
Contributor

@kingchc has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@kingchc kingchc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One general comment, I think we also need to remove TORCH_UCC_CHECK in https://github.com/facebookresearch/torch_ucc/blob/main/src/torch_ucc_comm.cpp#L186, so application won't error out when UCC TIMEOUT is returned. Not sure how CI test passes through it.

src/torch_ucc.cpp Outdated Show resolved Hide resolved
src/torch_ucc.cpp Outdated Show resolved Hide resolved
src/torch_ucc.cpp Outdated Show resolved Hide resolved
src/torch_ucc.cpp Outdated Show resolved Hide resolved
src/torch_ucc.cpp Outdated Show resolved Hide resolved
src/torch_ucc.cpp Show resolved Hide resolved
src/torch_ucc_comm.cpp Outdated Show resolved Hide resolved
@kingchc
Copy link
Contributor

kingchc commented Nov 3, 2021

When I tested the PR internally with some GPU training jobs (32 GPUs), I see some strange value of states (randomly).
e.g., (I print the value as well, but the value is kind of random)

[Rank 5][ProcessGroupUCC-1][COMM_CHECK][ERROR] on rank 21: Unknown state 32707

Still debugging this, look like some garbage value from somewhere?

src/torch_ucc.cpp Outdated Show resolved Hide resolved
@Sergei-Lebedev Sergei-Lebedev marked this pull request as ready for review November 9, 2021 12:59
@facebook-github-bot
Copy link
Contributor

@kingchc has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kingchc has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

params.datatype = ucp_dt_make_contig(sizeof(torch_ucc_rank_state_t));
params.user_data = (void*)state;
params.cb.send = [](void* request, ucs_status_t status, void* user_data) {
torch_ucc_rank_state_t * state = (torch_ucc_rank_state_t*)user_data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
is this required? we don't use state at all here since we do delete user_data next line. or do you intent to do delete state?

kingchc pushed a commit to kingchc/torch_ucc that referenced this pull request Nov 19, 2021
Summary: Pull Request resolved: facebookresearch#39

Differential Revision: D31765741

Pulled By: kingchc

fbshipit-source-id: 3932a76964b2d2a3412cb60796994e572041d690
kingchc pushed a commit to kingchc/torch_ucc that referenced this pull request Nov 19, 2021
Summary: Pull Request resolved: facebookresearch#39

Differential Revision: https://www.internalfb.com/diff/D31765741?entry_point=27

Pulled By: kingchc

fbshipit-source-id: eb987398fc22f12200273934513cbdc24b5bc93f
@facebook-github-bot
Copy link
Contributor

@kingchc has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kingchc has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

kingchc pushed a commit to kingchc/torch_ucc that referenced this pull request Dec 14, 2021
Summary: Pull Request resolved: facebookresearch#39

Reviewed By: srinivas212

Differential Revision: D31765741

Pulled By: kingchc

fbshipit-source-id: 71dceaa79bab323cc07547289932496f31976f3b
kingchc pushed a commit to kingchc/torch_ucc that referenced this pull request Dec 15, 2021
Summary: Pull Request resolved: facebookresearch#39

Reviewed By: srinivas212

Differential Revision: D31765741

Pulled By: kingchc

fbshipit-source-id: 140ac43108b8a9f97fdda622dc8910a0677afebd
Comment on lines +780 to 784
if (work->request_->status == UCC_ERR_TIMED_OUT) {
check_communicator_status(work->rank, work->comm_id, work->seq_num_,
work->eps);
}
work->finalize(eptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sergei-Lebedev - I just realized the CI tests are failing because I suggested to let progress thread aborts immediately in check_communicator_status and hence the main thread is not able to catch the exception. I think moving work->finalize(eptr) before check_communicator_status should fix it. WDYT?

Suggested change
if (work->request_->status == UCC_ERR_TIMED_OUT) {
check_communicator_status(work->rank, work->comm_id, work->seq_num_,
work->eps);
}
work->finalize(eptr);
work->finalize(eptr);
if (work->request_->status == UCC_ERR_TIMED_OUT) {
check_communicator_status(work->rank, work->comm_id, work->seq_num_,
work->eps);
}

kingchc pushed a commit to kingchc/torch_ucc that referenced this pull request Feb 4, 2022
Summary: Pull Request resolved: facebookresearch#39

Reviewed By: srinivas212

Differential Revision: D31765741

Pulled By: kingchc

fbshipit-source-id: b34a63a86d1de06ee214b882797e37c376f5ce03
@facebook-github-bot
Copy link
Contributor

@kingchc has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants