Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

event driven approach #237

Merged
merged 1 commit into from
Sep 28, 2022
Merged

Commits on Sep 28, 2022

  1. event driven approach

    The flame library presents some inefficiency.
    
    1) In order to check if there is at least one peer in a channel, it
    busy-waits by calling empty() function and sleep one second.
    
    2) head-of-line blocking problem: to recv data from peers of a
    channel, the implementation gets the list of peers, loops through
    each of peers and calls recv(). If the first peer is a straggler, it
    causes a head-of-line blocking problem.
    
    3) When model size increases, it takes a while to merge transmitted
    messages (it is byte array concatenation operation). This process
    takes place in a coroutine; and it isn't suspended until it is
    finished. This delays a heart beat transmission as the heart beat
    function is also a coroutine. The delayed heart beat triggers a
    timeout at the other side of grpc channel, which breaks a training job.
    
    The first two issues are addressed in an event-driven manner. The
    third problem is addressed by using threading. The byte array
    concatenation takes place in a separate thread. So, the asyncio coroutine
    of handling message will release cpu resource quickly. Therefore, the
    heart beat transmission function works in a timely manner.
    
    Also, graceful termination is enforced by ensuring data transfer is
    complete.
    myungjin committed Sep 28, 2022
    Configuration menu
    Copy the full SHA
    a3b3c46 View commit details
    Browse the repository at this point in the history