updated feddyn implementation pytorch #392

GustavBaumgart · 2023-04-04T22:57:29Z

The new implementation matches the algorithm implemented in the paper. This is no longer is similar to the pseudocode they provide.

The new implementation includes an updated communication protocol where all trainers send the dataset size for that round to the aggregator at the beginning of every round. This allows the aggregator to compute adaptive hyperparameters for each trainer individually and then send alpha_adpt along with the global model to begin training for the round.

codecov-commenter · 2023-04-04T23:03:36Z

Codecov Report

Merging #392 (f22515b) into main (35a46de) will not change coverage.
The diff coverage is n/a.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@           Coverage Diff           @@
##             main     #392   +/-   ##
=======================================
  Coverage   15.15%   15.15%           
=======================================
  Files          48       48           
  Lines        2778     2778           
=======================================
  Hits          421      421           
  Misses       2328     2328           
  Partials       29       29

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

myungjin

left a couple of comments. some needs an offline discussion.

myungjin · 2023-04-05T22:16:19Z

lib/python/flame/common/constants.py

+
+    PRE_TRAIN = 'pre_train'
+    DURING_TRAIN = 'during_train'
+    POST_TRAIN = 'post_train'


None of DURING_TRAIN, POST_TRAIN seem to be used in this PR. Also, the naming can be improved too.
To be precise, aggregator doesn't have "training"-related state.

Also, TrainerState and AggregatorState are exactly the same. If you want to keep the same set of states (PRE_TRAIN, DURING_TRAIN, POST_TRAIN), why not have only one enum class?

Maybe how about the following?

class TrainState(Enum): PRE = "pre" DURING = "during" POST = "post"

Yeah, I think I can use the TrainState for both trainer and aggregator updates to the regularizer.

myungjin · 2023-04-05T22:23:45Z

lib/python/flame/examples/medmnist_feddyn/aggregator/template.json

+                "aggregator": [
+                    "distribute",
+                    "aggregate",
+                    "alpha"


This function tag is confusing. You may want to use verb to indicate it's a function. alpha sounds like a variable / argument; from optimizer alpha is indeed an argument.

I'm thinking "get_dataset_size"

How about getDatasetSize

myungjin · 2023-04-05T22:24:03Z

lib/python/flame/examples/medmnist_feddyn/aggregator/template.json

+                "trainer": [
+                    "fetch",
+                    "upload",
+                    "dataset"


same here. please rename it with a verb.

"upload_dataset_size"?

How about uploadDatasetSize?

myungjin · 2023-04-05T22:24:26Z

lib/python/flame/examples/medmnist_feddyn/aggregator/template.json

+        "kwargs": {}
+    },
+    "optimizer": {
+    "sort": "feddyn",


wrong indentation

myungjin · 2023-04-05T22:29:29Z

lib/python/flame/mode/horizontal/feddyn/top_aggregator.py

+
+import logging
+
+from ....common.util import (MLFramework, get_ml_framework_in_use,


Let's use absolute path instead of relative path.

myungjin · 2023-04-05T22:38:23Z

lib/python/flame/mode/horizontal/feddyn/top_aggregator.py

+        elif tag == TAG_ALPHA:
+            self._save_sizes(tag)
+
+    def _save_sizes(self, tag: str) -> None:


perhaps name the function as get_dataset_size() ?

lib/python/flame/mode/horizontal/feddyn/top_aggregator.py

myungjin

left a few comments

myungjin · 2023-04-06T23:22:11Z

lib/python/flame/mode/horizontal/coord_syncfl/trainer.py

@@ -82,7 +82,7 @@ def _send_weights(self, tag: str) -> None:
        channel.await_join()

        self._update_weights()
-        self.regularizer.save_state(TrainerState.POST_TRAIN, loc_model=self.model)
+        self.regularizer.save_state(TrainState.POST, loc_model=self.model)


why is this needed at coord_syncfl? coord_syncfl is a hierarchical fl. So, I guess we can't use feddyn with coord_syncfl.

At the time I just saw it was already there, so I just modified the line. Does the new PR remove it? I could also just remove it as well.

This might have happened because this was originally (and still is) in syncfl, even though it is really only used in FedDyn as of now.

Decided to remove this ^

myungjin · 2023-04-06T23:23:03Z

lib/python/flame/mode/horizontal/feddyn/top_aggregator.py

+from ...message import MessageType
+from ...tasklet import Loop, Tasklet
+
+from ..top_aggregator import TopAggregator as BaseTopAggregator


Please use the absolute path instead of the relative path.

Oh I see, will do this for everything in lib/python/flame/mode/horizontal/feddyn

myungjin · 2023-04-06T23:24:10Z

lib/python/flame/mode/horizontal/syncfl/trainer.py

@@ -156,7 +157,7 @@ def _send_weights(self, tag: str) -> None:
        end = channel.one_end(VAL_CH_STATE_SEND)

        self._update_weights()
-        self.regularizer.save_state(TrainerState.POST_TRAIN, loc_model=self.model)
+        self.regularizer.save_state(TrainState.POST, loc_model=self.model)


Now that there is a dedicated trainer.py for feddyn, do we still need this here?

I could move this to the FedDyn trainer only (making the file longer) since it is the only place I use it now. Previously, I thought I might leave it there in case it's used in another optimizer as well, but it might not be worth it since, if removed here, save_state can me removed from other regularizers as well.

The new implementation matches the algorithm implemented in the paper. This is no longer is similar to the pseudocode they provide. The new implmentation includes an updated communication protocol where all trainers send the dataset size for that round to the aggregator at the beginning of every round. This allows the aggregator to compute adaptive hyperparameters for each trainer individually and then send ALPHA_ADPT along with the global model to begin training for the round.

myungjin

lgtm

jaemin-shin

Overall LGTM!

Some thoughts on sharing data sizes from trainers:

Could sharing dataset sizes of individual trainers to aggregators be regarded as potential privacy leak? If so, what FedDyn requires for setting up its appropriate alpha value incurs tradeoff between accuracy and privacy, which we may want to consider when further investigating with FedDyn.
How should dataset sizes of trainers be shared when there are multiple aggregators or hierarchical aggregators? I think we do not need to implement it as of now in this PR, but further evaluating it in our experiments with different configurations could be tricky.

jaemin-shin · 2023-04-10T14:46:55Z

lib/python/flame/mode/horizontal/feddyn/top_aggregator.py

+
+            task_load_data = Tasklet("", self.load_data)
+
+            task_get_dataset = Tasklet("", self.get, TAG_GET_DATATSET_SIZE)


maybe we should name it as "task_get_dataset_size" instead of "task_get_dataset" as it could be misleading?

GustavBaumgart requested review from jaemin-shin, lkurija1 and myungjin and removed request for jaemin-shin April 4, 2023 22:57

myungjin reviewed Apr 5, 2023

View reviewed changes

GustavBaumgart force-pushed the feddyn branch 3 times, most recently from c529e82 to a2baad9 Compare April 6, 2023 21:59

myungjin reviewed Apr 6, 2023

View reviewed changes

GustavBaumgart force-pushed the feddyn branch from a2baad9 to f22515b Compare April 7, 2023 01:12

myungjin approved these changes Apr 7, 2023

View reviewed changes

GustavBaumgart merged commit 7b4c99a into cisco-open:main Apr 7, 2023

jaemin-shin reviewed Apr 10, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updated feddyn implementation pytorch #392

updated feddyn implementation pytorch #392

GustavBaumgart commented Apr 4, 2023

codecov-commenter commented Apr 4, 2023 •

edited

Loading

myungjin left a comment

myungjin Apr 5, 2023

GustavBaumgart Apr 6, 2023

myungjin Apr 5, 2023

GustavBaumgart Apr 6, 2023

myungjin Apr 6, 2023

myungjin Apr 5, 2023

GustavBaumgart Apr 6, 2023 •

edited

Loading

myungjin Apr 6, 2023

myungjin Apr 5, 2023

GustavBaumgart Apr 6, 2023

myungjin Apr 5, 2023

myungjin Apr 5, 2023

myungjin left a comment

myungjin Apr 6, 2023

GustavBaumgart Apr 6, 2023

GustavBaumgart Apr 6, 2023

GustavBaumgart Apr 7, 2023

myungjin Apr 6, 2023

GustavBaumgart Apr 6, 2023

myungjin Apr 6, 2023

GustavBaumgart Apr 6, 2023

myungjin left a comment

jaemin-shin left a comment

jaemin-shin Apr 10, 2023


		import logging

		from ....common.util import (MLFramework, get_ml_framework_in_use,


		task_load_data = Tasklet("", self.load_data)

		task_get_dataset = Tasklet("", self.get, TAG_GET_DATATSET_SIZE)

updated feddyn implementation pytorch #392

updated feddyn implementation pytorch #392

Conversation

GustavBaumgart commented Apr 4, 2023

codecov-commenter commented Apr 4, 2023 • edited Loading

Codecov Report

myungjin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GustavBaumgart Apr 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

myungjin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

myungjin left a comment

Choose a reason for hiding this comment

jaemin-shin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Apr 4, 2023 •

edited

Loading

GustavBaumgart Apr 6, 2023 •

edited

Loading