Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

event driven approach #237

Merged
merged 1 commit into from
Sep 28, 2022
Merged

Conversation

myungjin
Copy link
Contributor

The flame library presents some inefficiency.

  1. In order to check if there is at least one peer in a channel, it busy-waits by calling empty() function and sleep one second.

  2. head-of-line blocking problem: to recv data from peers of a channel, the implementation gets the list of peers, loops through each of peers and calls recv(). If the first peer is a straggler, it causes a head-of-line blocking problem.

  3. When model size increases, it takes a while to merge transmitted messages (it is byte array concatenation operation). This process takes place in a coroutine; and it isn't suspended until it is finished. This delays a heart beat transmission as the heart beat function is also a coroutine. The delayed heart beat triggers a timeout at the other side of grpc channel, which breaks a training job.

The first two issues are addressed in an event-driven manner. The third problem is addressed by using threading. The byte array concatenation takes place in a separate thread. So, the asyncio coroutine of handling message will release cpu resource quickly. Therefore, the heart beat transmission function works in a timely manner.

Also, graceful termination is enforced by ensuring data transfer is complete.

Copy link
Contributor

@RaviKhandavilli RaviKhandavilli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

self.p2pbe._set_heart_beat(req.end_id)

await dck_task

async def _dummy_context_keeper(self, context):
"""Sleep 1000 sec (an abitrary big value) to keep context alive."""
"""Sleep 100000 sec (an abitrary big value) to keep context alive."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if there is any method , as long sleeps does not look good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good suggestion. I will address your comment.

lib/python/flame/common/constants.py Show resolved Hide resolved
The flame library presents some inefficiency.

1) In order to check if there is at least one peer in a channel, it
busy-waits by calling empty() function and sleep one second.

2) head-of-line blocking problem: to recv data from peers of a
channel, the implementation gets the list of peers, loops through
each of peers and calls recv(). If the first peer is a straggler, it
causes a head-of-line blocking problem.

3) When model size increases, it takes a while to merge transmitted
messages (it is byte array concatenation operation). This process
takes place in a coroutine; and it isn't suspended until it is
finished. This delays a heart beat transmission as the heart beat
function is also a coroutine. The delayed heart beat triggers a
timeout at the other side of grpc channel, which breaks a training job.

The first two issues are addressed in an event-driven manner. The
third problem is addressed by using threading. The byte array
concatenation takes place in a separate thread. So, the asyncio coroutine
of handling message will release cpu resource quickly. Therefore, the
heart beat transmission function works in a timely manner.

Also, graceful termination is enforced by ensuring data transfer is
complete.
Copy link
Contributor

@RaviKhandavilli RaviKhandavilli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@myungjin myungjin merged commit 6ea5606 into cisco-open:main Sep 28, 2022
@myungjin myungjin deleted the event_driven_approach branch September 28, 2022 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Flame with grpc backend fails to perform weights aggregation for ResNet50 Model
2 participants