Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: recv_fifo for asyncfl #449

Merged
merged 1 commit into from
Sep 20, 2023
Merged

Conversation

myungjin
Copy link
Contributor

Description

recv_fifo() takes first_k as one of its arguments to determine the first_k number of messages are received and returned.

To implement the functionality, asynchronous call
_streamer_for_recv_fifo() was implemented. However the function has a bug that is triggered in asyncfl. The function adds the first_k messages into an rx queue and re-adds the remaining messages to the head of each corresponding end id's rx queue. The remaining messages then processed in the next recv_fifo call. However, since _streamer_for_recv_fifo() is an asynchronous call, there is no guarantee that the messages saved in the previous call will be processed first. This causes some trainers blocked and made them fail to participate future rounds.

The issue is fixed as follows. Since recv_fifo() always returns first_k number of messages from the rx queue, there is no harm even if the remaining messages are added to the queue. By removing all the code to handle the remaining code, the code becomes now clean and easy to understand.

Type of Change

  • Bug Fix
  • New Feature
  • Breaking Change
  • Refactor
  • Documentation
  • Other (please describe)

Checklist

  • I have read the contributing guidelines
  • Existing issues have been referenced (where applicable)
  • I have verified this change is not present in other open pull requests
  • Functionality is documented
  • All code style checks pass
  • New code contribution is covered by automated tests
  • All new and existing tests pass

recv_fifo() takes first_k as one of its arguments to determine the
first_k number of messages are received and returned.

To implement the functionality, asynchronous call
_streamer_for_recv_fifo() was implemented. However the function has a
bug that is triggered in asyncfl. The function adds the first_k
messages into an rx queue and re-adds the remaining messages to the
head of each corresponding end id's rx queue. The remaining messages
then processed in the next recv_fifo call. However,
since _streamer_for_recv_fifo() is an asynchronous call, there is no
guarantee that the messages saved in the previous call will be
processed first. This causes some trainers blocked and made them fail
to participate future rounds.

The issue is fixed as follows. Since recv_fifo() always returns
first_k number of messages from the rx queue, there is no harm even if
the remaining messages are added to the queue. By removing all the
code to handle the remaining code, the code becomes now clean and easy
to understand.
@codecov-commenter
Copy link

Codecov Report

Merging #449 (fa494d7) into main (e175d42) will not change coverage.
The diff coverage is n/a.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

@@           Coverage Diff           @@
##             main     #449   +/-   ##
=======================================
  Coverage   14.82%   14.82%           
=======================================
  Files          48       48           
  Lines        2827     2827           
=======================================
  Hits          419      419           
  Misses       2379     2379           
  Partials       29       29           

@myungjin myungjin merged commit d99fb23 into cisco-open:main Sep 20, 2023
3 checks passed
@myungjin myungjin deleted the fix_recv_fifo branch September 20, 2023 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants