Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updated feddyn implementation pytorch #392

Merged
merged 1 commit into from
Apr 7, 2023

Conversation

GustavBaumgart
Copy link
Collaborator

The new implementation matches the algorithm implemented in the paper. This is no longer is similar to the pseudocode they provide.

The new implementation includes an updated communication protocol where all trainers send the dataset size for that round to the aggregator at the beginning of every round. This allows the aggregator to compute adaptive hyperparameters for each trainer individually and then send alpha_adpt along with the global model to begin training for the round.

@GustavBaumgart GustavBaumgart requested review from jaemin-shin, lkurija1 and myungjin and removed request for jaemin-shin April 4, 2023 22:57
@codecov-commenter
Copy link

codecov-commenter commented Apr 4, 2023

Codecov Report

Merging #392 (f22515b) into main (35a46de) will not change coverage.
The diff coverage is n/a.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@           Coverage Diff           @@
##             main     #392   +/-   ##
=======================================
  Coverage   15.15%   15.15%           
=======================================
  Files          48       48           
  Lines        2778     2778           
=======================================
  Hits          421      421           
  Misses       2328     2328           
  Partials       29       29           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Contributor

@myungjin myungjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a couple of comments. some needs an offline discussion.


PRE_TRAIN = 'pre_train'
DURING_TRAIN = 'during_train'
POST_TRAIN = 'post_train'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of DURING_TRAIN, POST_TRAIN seem to be used in this PR. Also, the naming can be improved too.
To be precise, aggregator doesn't have "training"-related state.

Also, TrainerState and AggregatorState are exactly the same. If you want to keep the same set of states (PRE_TRAIN, DURING_TRAIN, POST_TRAIN), why not have only one enum class?

Maybe how about the following?

class TrainState(Enum):
    PRE = "pre"
    DURING = "during"
    POST = "post"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I can use the TrainState for both trainer and aggregator updates to the regularizer.

"aggregator": [
"distribute",
"aggregate",
"alpha"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function tag is confusing. You may want to use verb to indicate it's a function. alpha sounds like a variable / argument; from optimizer alpha is indeed an argument.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking "get_dataset_size"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about getDatasetSize

"trainer": [
"fetch",
"upload",
"dataset"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. please rename it with a verb.

Copy link
Collaborator Author

@GustavBaumgart GustavBaumgart Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"upload_dataset_size"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about uploadDatasetSize?

"kwargs": {}
},
"optimizer": {
"sort": "feddyn",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong indentation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it!


import logging

from ....common.util import (MLFramework, get_ml_framework_in_use,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use absolute path instead of relative path.

elif tag == TAG_ALPHA:
self._save_sizes(tag)

def _save_sizes(self, tag: str) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps name the function as get_dataset_size() ?

Copy link
Contributor

@myungjin myungjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments

@@ -82,7 +82,7 @@ def _send_weights(self, tag: str) -> None:
channel.await_join()

self._update_weights()
self.regularizer.save_state(TrainerState.POST_TRAIN, loc_model=self.model)
self.regularizer.save_state(TrainState.POST, loc_model=self.model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed at coord_syncfl? coord_syncfl is a hierarchical fl. So, I guess we can't use feddyn with coord_syncfl.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time I just saw it was already there, so I just modified the line. Does the new PR remove it? I could also just remove it as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might have happened because this was originally (and still is) in syncfl, even though it is really only used in FedDyn as of now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to remove this ^

from ...message import MessageType
from ...tasklet import Loop, Tasklet

from ..top_aggregator import TopAggregator as BaseTopAggregator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the absolute path instead of the relative path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, will do this for everything in lib/python/flame/mode/horizontal/feddyn

@@ -156,7 +157,7 @@ def _send_weights(self, tag: str) -> None:
end = channel.one_end(VAL_CH_STATE_SEND)

self._update_weights()
self.regularizer.save_state(TrainerState.POST_TRAIN, loc_model=self.model)
self.regularizer.save_state(TrainState.POST, loc_model=self.model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that there is a dedicated trainer.py for feddyn, do we still need this here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could move this to the FedDyn trainer only (making the file longer) since it is the only place I use it now. Previously, I thought I might leave it there in case it's used in another optimizer as well, but it might not be worth it since, if removed here, save_state can me removed from other regularizers as well.

The new implementation matches the algorithm implemented in the paper.
This is no longer is similar to the pseudocode they provide.

The new implmentation includes an updated communication protocol where all trainers send the dataset size for that round to the aggregator at the beginning of every round.
This allows the aggregator to compute adaptive hyperparameters for each trainer individually and then send ALPHA_ADPT along with the global model to begin training for the round.
Copy link
Contributor

@myungjin myungjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@GustavBaumgart GustavBaumgart merged commit 7b4c99a into cisco-open:main Apr 7, 2023
Copy link
Collaborator

@jaemin-shin jaemin-shin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM!

Some thoughts on sharing data sizes from trainers:

  • Could sharing dataset sizes of individual trainers to aggregators be regarded as potential privacy leak? If so, what FedDyn requires for setting up its appropriate alpha value incurs tradeoff between accuracy and privacy, which we may want to consider when further investigating with FedDyn.
  • How should dataset sizes of trainers be shared when there are multiple aggregators or hierarchical aggregators? I think we do not need to implement it as of now in this PR, but further evaluating it in our experiments with different configurations could be tricky.


task_load_data = Tasklet("", self.load_data)

task_get_dataset = Tasklet("", self.get, TAG_GET_DATATSET_SIZE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should name it as "task_get_dataset_size" instead of "task_get_dataset" as it could be misleading?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants