event driven approach #237

myungjin · 2022-09-28T03:43:26Z

The flame library presents some inefficiency.

In order to check if there is at least one peer in a channel, it busy-waits by calling empty() function and sleep one second.
head-of-line blocking problem: to recv data from peers of a channel, the implementation gets the list of peers, loops through each of peers and calls recv(). If the first peer is a straggler, it causes a head-of-line blocking problem.
When model size increases, it takes a while to merge transmitted messages (it is byte array concatenation operation). This process takes place in a coroutine; and it isn't suspended until it is finished. This delays a heart beat transmission as the heart beat function is also a coroutine. The delayed heart beat triggers a timeout at the other side of grpc channel, which breaks a training job.

The first two issues are addressed in an event-driven manner. The third problem is addressed by using threading. The byte array concatenation takes place in a separate thread. So, the asyncio coroutine of handling message will release cpu resource quickly. Therefore, the heart beat transmission function works in a timely manner.

Also, graceful termination is enforced by ensuring data transfer is complete.

RaviKhandavilli

LGTM

RaviKhandavilli · 2022-09-28T15:29:41Z

lib/python/flame/backend/p2p.py

        self.p2pbe._set_heart_beat(req.end_id)

        await dck_task

    async def _dummy_context_keeper(self, context):
-        """Sleep 1000 sec (an abitrary big value) to keep context alive."""
+        """Sleep 100000 sec (an abitrary big value) to keep context alive."""


check if there is any method , as long sleeps does not look good

That is a good suggestion. I will address your comment.

lib/python/flame/common/constants.py

The flame library presents some inefficiency. 1) In order to check if there is at least one peer in a channel, it busy-waits by calling empty() function and sleep one second. 2) head-of-line blocking problem: to recv data from peers of a channel, the implementation gets the list of peers, loops through each of peers and calls recv(). If the first peer is a straggler, it causes a head-of-line blocking problem. 3) When model size increases, it takes a while to merge transmitted messages (it is byte array concatenation operation). This process takes place in a coroutine; and it isn't suspended until it is finished. This delays a heart beat transmission as the heart beat function is also a coroutine. The delayed heart beat triggers a timeout at the other side of grpc channel, which breaks a training job. The first two issues are addressed in an event-driven manner. The third problem is addressed by using threading. The byte array concatenation takes place in a separate thread. So, the asyncio coroutine of handling message will release cpu resource quickly. Therefore, the heart beat transmission function works in a timely manner. Also, graceful termination is enforced by ensuring data transfer is complete.

RaviKhandavilli

LGTM

RaviKhandavilli previously approved these changes Sep 28, 2022

View reviewed changes

myungjin dismissed RaviKhandavilli’s stale review via a3b3c46 September 28, 2022 16:07

myungjin force-pushed the event_driven_approach branch from 17f97b3 to a3b3c46 Compare September 28, 2022 16:07

RaviKhandavilli approved these changes Sep 28, 2022

View reviewed changes

myungjin merged commit 6ea5606 into cisco-open:main Sep 28, 2022

myungjin deleted the event_driven_approach branch September 28, 2022 16:25

myungjin linked an issue Sep 28, 2022 that may be closed by this pull request

[BUG] Flame with grpc backend fails to perform weights aggregation for ResNet50 Model #233

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

event driven approach #237

event driven approach #237

myungjin commented Sep 28, 2022

RaviKhandavilli left a comment

RaviKhandavilli Sep 28, 2022

myungjin Sep 28, 2022

RaviKhandavilli left a comment

event driven approach #237

event driven approach #237

Conversation

myungjin commented Sep 28, 2022

RaviKhandavilli left a comment

Choose a reason for hiding this comment

RaviKhandavilli Sep 28, 2022

Choose a reason for hiding this comment

myungjin Sep 28, 2022

Choose a reason for hiding this comment

RaviKhandavilli left a comment

Choose a reason for hiding this comment