Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: distributed mode #344

Merged
merged 1 commit into from
Feb 23, 2023
Merged

fix: distributed mode #344

merged 1 commit into from
Feb 23, 2023

Conversation

myungjin
Copy link
Contributor

Description

Distributed mode has a bug: before 'weights' is not defined as member variable, deepcopy(self.weights) in _update_weights() is called. To address this issue, self.weights is initialized in init().

Also, to run a distributed example locally, configuration files are revised.

Type of Change

  • Bug Fix
  • New Feature
  • Breaking Change
  • Refactor
  • Documentation
  • Other (please describe)

Checklist

  • I have read the contributing guidelines
  • Existing issues have been referenced (where applicable)
  • I have verified this change is not present in other open pull requests
  • Functionality is documented
  • All code style checks pass
  • New code contribution is covered by automated tests
  • All new and existing tests pass

Distributed mode has a bug: before 'weights' is not defined as member
variable, deepcopy(self.weights) in _update_weights() is called.
To address this issue, self.weights is initialized in __init__().

Also, to run a distributed example locally, configuration files are
revised.
@codecov-commenter
Copy link

Codecov Report

Merging #344 (e1f9d14) into main (af56dae) will not change coverage.
The diff coverage is n/a.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@           Coverage Diff           @@
##             main     #344   +/-   ##
=======================================
  Coverage   15.04%   15.04%           
=======================================
  Files          48       48           
  Lines        2824     2824           
=======================================
  Hits          425      425           
  Misses       2381     2381           
  Partials       18       18           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Contributor

@lkurija1 lkurija1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
We might want to standardise which broker we use in our examples, also if we want to include mlflow or not.

@myungjin
Copy link
Contributor Author

@elqurio Good point. I am thinking that examples in the library should use dummy registry only. setting mlflow up is not a focus of library. that's related to infrastructure.

@myungjin myungjin merged commit 19c39eb into cisco-open:main Feb 23, 2023
@myungjin myungjin deleted the fix_dist branch February 23, 2023 17:33
openwithcode added a commit that referenced this pull request Mar 3, 2023
* optimizer compatibility with tensorflow and example for medmnist keras/pytorch (#320)

Tensorflow compatibility for new optimizers was added, which included fedavg, fedadam, fedadagrad, and fedyogi.

A shell script for tesing all 8 possible combinations of optimizers and frameworks is included.
This allows the medmnist example to be run with keras (the folder structure was refactored to include a trainer and aggregator for keras).

The typo in fedavg.py has now been fixed.

* feat+fix: grpc support for hierarchical fl (#321)

Hierarchical fl didn't work with grpc as backend. This is because
groupby field was not considered in metaserver service and p2p
backend.

In addition, a middle aggregator hangs even after a job is
completed. This deadlock occurs because p2p backend cleanup code is
called as a part of a channel cleanup. However, in a middle
aggregator, p2p backend is responsible for tasks across all
channnels. The p2p cleanup code couldn't finish cleanup because
a broadcast task for in the other channel can't finish. This bug is
fixed here by getting the p2p backend cleanup code out side of channel
cleanup code.

* documenation for metaserver/mqtt local (#322)

Documentation for using metaserver will allow users to run examples with a local broker.
It also allows for mqtt local brokers.
This decreases the chances of any job ID collisions.

Modifications to the config.json for the mnist example were made in order to make it easier to switch to a local broker.
The readme does indicate how to do this for other examples now.

Co-authored-by: vboxuser <vboxuser@Ubuntu.myguest.virtualbox.org>

* feat: asynchronous fl (#323)

Asynchronous FL is implemented for two-tier topology and three-tier
hierarchical topology.

The main algorithm is based on the following two papers:
- https://arxiv.org/pdf/2111.04877.pdf
- https://arxiv.org/pdf/2106.06639.pdf

Two examples for asynchronous fl are also added. One is for a two-tier
topology and the other for a three-tier hierarchical topology.

This implementation includes the core algorithm but  doesn't include
SecAgg algorithm (presented in the papers), which is not the scope of
this change.

* fix+refactor: asyncfl loss divergence (#330)

For asyncfl, a client (trainer) should send delta by subtracting local
weights from original global weights after training. In the current
implementation, the whole local weights were sent to a
server (aggregator). This causes loss divergence.

Supporting delta update requires refactoring of aggregators of
synchronous fl (horizontal/{top_aggregator.py, middle_aggregator.py})
as well as optimizers' do() function.

The changes here support delta update universally across all types of
modes (horizontal synchronous, asynchronous, and hybrid).

* fix: conflict bewtween integer tensor and float tensor (#335)

Model architectures can have integer tensors. Applying aggregation on
those tensors results in type mistmatch and throws a runtime error:
"RuntimeError: result type Float can't be cast to the desired output
type Long"

Integer tensors don't matter in back propagation. So, as a workaround
to the issue, we typecast to the original dtype when the original type
is different from the dtype of weighted tensors for aggregation. In
this way, we can keep the model architecture as is.

* refactor: config for hybrid example in library (#334)

To enable library-only execution for hybrid example, its configuration
files are updated accordingly. The revised configuration has local
mqtt and p2p broker config and p2p broker is selected.

* misc: asynchronous hierarchical fl example (#340)

Since the Flame SDK supports asynchronous FL, we add an example of an
asynchronous hierarchical FL for control plane.

* chore: clean up examples folder (#336)

The examples folder at the top level directory has some outdated and
irrelevant files. Those are now removed from the folder.

* fix: workaround for hybrid mode with two p2p backends (#345)

Due to grpc/grpc#25364, when two p2p
backends (which rely on grpc and asyncio) are defined, the hybrid mode
example throws an execption: 'BlockingIOError: [Errno 35] Resource
temporarily unavailable'. The issue still appears unresolved. As a
temporary workaround, we use two different types of backends: mqtt for
one and p2p for the other. This means that when this example is
executed, both metaserver and a mqtt broker (e.g., mosquitto) must be
running in the local machine.

* fix: distributed mode (#344)

Distributed mode has a bug: before 'weights' is not defined as member
variable, deepcopy(self.weights) in _update_weights() is called.
To address this issue, self.weights is initialized in __init__().

Also, to run a distributed example locally, configuration files are
revised.

* example/implementation for fedprox (#339)

This example is similar to the ones seen in the fedprox paper, although it currently does not simmulate stragglers and uses another dataset/architecture.

A few things were changed in order for there to be a simple process for modifying trainers.
This includes a function in util.py and another class variable in the trainer containing information on the client side regularizer.

Additionally, tests are automated (mu=1,0.1,0.01,0.001,0) so running the example generates or modifies existing files in order to provide the propper configuration for an experiment.

* Create diagnose script (#348)

* Create diagnose script

* Make the script executable

---------

Co-authored-by: Alex Ungurean <aungurea@cisco.com>

* refactor+fix: configurable deployer / lib regularizer fix (#351)

deployer's job template file is hard-coded, which makes it hard to use
different template file at deployment time. Using different different
template file is useful when underlying infrastructure is
different (e.g., k8s vs knative). To support that, template folder and
file is fed as config variables.

Also, deployer's config info is fed as command argument, which is
cumbersome. So, the config parsing part is refactored such that the
info is fed as a configuration file.

During the testing of deployer change, a bug in the library
is identified. The fix for it is added here too.

Finally, the local dns configuration in flame.sh is updated so that it
can be done correctly across different linux distributions (e.g.,
archlinux and ubuntu). The tests for flame.sh are under archlinux and
ubuntu.

* Add missing merge fix

* Make sdk config backwards compatible. (#355)

---------

Co-authored-by: GustavBaumgart <98069699+GustavBaumgart@users.noreply.github.com>
Co-authored-by: Myungjin Lee <myungjin@users.noreply.github.com>
Co-authored-by: vboxuser <vboxuser@Ubuntu.myguest.virtualbox.org>
Co-authored-by: alexandruuBytex <56033021+alexandruuBytex@users.noreply.github.com>
Co-authored-by: Alex Ungurean <aungurea@cisco.com>
Co-authored-by: elqurio <119978637+elqurio@users.noreply.github.com>
openwithcode added a commit that referenced this pull request Mar 3, 2023
* optimizer compatibility with tensorflow and example for medmnist keras/pytorch (#320)

Tensorflow compatibility for new optimizers was added, which included fedavg, fedadam, fedadagrad, and fedyogi.

A shell script for tesing all 8 possible combinations of optimizers and frameworks is included.
This allows the medmnist example to be run with keras (the folder structure was refactored to include a trainer and aggregator for keras).

The typo in fedavg.py has now been fixed.

* feat+fix: grpc support for hierarchical fl (#321)

Hierarchical fl didn't work with grpc as backend. This is because
groupby field was not considered in metaserver service and p2p
backend.

In addition, a middle aggregator hangs even after a job is
completed. This deadlock occurs because p2p backend cleanup code is
called as a part of a channel cleanup. However, in a middle
aggregator, p2p backend is responsible for tasks across all
channnels. The p2p cleanup code couldn't finish cleanup because
a broadcast task for in the other channel can't finish. This bug is
fixed here by getting the p2p backend cleanup code out side of channel
cleanup code.

* documenation for metaserver/mqtt local (#322)

Documentation for using metaserver will allow users to run examples with a local broker.
It also allows for mqtt local brokers.
This decreases the chances of any job ID collisions.

Modifications to the config.json for the mnist example were made in order to make it easier to switch to a local broker.
The readme does indicate how to do this for other examples now.

Co-authored-by: vboxuser <vboxuser@Ubuntu.myguest.virtualbox.org>

* feat: asynchronous fl (#323)

Asynchronous FL is implemented for two-tier topology and three-tier
hierarchical topology.

The main algorithm is based on the following two papers:
- https://arxiv.org/pdf/2111.04877.pdf
- https://arxiv.org/pdf/2106.06639.pdf

Two examples for asynchronous fl are also added. One is for a two-tier
topology and the other for a three-tier hierarchical topology.

This implementation includes the core algorithm but  doesn't include
SecAgg algorithm (presented in the papers), which is not the scope of
this change.

* fix+refactor: asyncfl loss divergence (#330)

For asyncfl, a client (trainer) should send delta by subtracting local
weights from original global weights after training. In the current
implementation, the whole local weights were sent to a
server (aggregator). This causes loss divergence.

Supporting delta update requires refactoring of aggregators of
synchronous fl (horizontal/{top_aggregator.py, middle_aggregator.py})
as well as optimizers' do() function.

The changes here support delta update universally across all types of
modes (horizontal synchronous, asynchronous, and hybrid).

* fix: conflict bewtween integer tensor and float tensor (#335)

Model architectures can have integer tensors. Applying aggregation on
those tensors results in type mistmatch and throws a runtime error:
"RuntimeError: result type Float can't be cast to the desired output
type Long"

Integer tensors don't matter in back propagation. So, as a workaround
to the issue, we typecast to the original dtype when the original type
is different from the dtype of weighted tensors for aggregation. In
this way, we can keep the model architecture as is.

* refactor: config for hybrid example in library (#334)

To enable library-only execution for hybrid example, its configuration
files are updated accordingly. The revised configuration has local
mqtt and p2p broker config and p2p broker is selected.

* misc: asynchronous hierarchical fl example (#340)

Since the Flame SDK supports asynchronous FL, we add an example of an
asynchronous hierarchical FL for control plane.

* chore: clean up examples folder (#336)

The examples folder at the top level directory has some outdated and
irrelevant files. Those are now removed from the folder.

* fix: workaround for hybrid mode with two p2p backends (#345)

Due to grpc/grpc#25364, when two p2p
backends (which rely on grpc and asyncio) are defined, the hybrid mode
example throws an execption: 'BlockingIOError: [Errno 35] Resource
temporarily unavailable'. The issue still appears unresolved. As a
temporary workaround, we use two different types of backends: mqtt for
one and p2p for the other. This means that when this example is
executed, both metaserver and a mqtt broker (e.g., mosquitto) must be
running in the local machine.

* fix: distributed mode (#344)

Distributed mode has a bug: before 'weights' is not defined as member
variable, deepcopy(self.weights) in _update_weights() is called.
To address this issue, self.weights is initialized in __init__().

Also, to run a distributed example locally, configuration files are
revised.

* example/implementation for fedprox (#339)

This example is similar to the ones seen in the fedprox paper, although it currently does not simmulate stragglers and uses another dataset/architecture.

A few things were changed in order for there to be a simple process for modifying trainers.
This includes a function in util.py and another class variable in the trainer containing information on the client side regularizer.

Additionally, tests are automated (mu=1,0.1,0.01,0.001,0) so running the example generates or modifies existing files in order to provide the propper configuration for an experiment.

* Create diagnose script (#348)

* Create diagnose script

* Make the script executable

---------

Co-authored-by: Alex Ungurean <aungurea@cisco.com>

* refactor+fix: configurable deployer / lib regularizer fix (#351)

deployer's job template file is hard-coded, which makes it hard to use
different template file at deployment time. Using different different
template file is useful when underlying infrastructure is
different (e.g., k8s vs knative). To support that, template folder and
file is fed as config variables.

Also, deployer's config info is fed as command argument, which is
cumbersome. So, the config parsing part is refactored such that the
info is fed as a configuration file.

During the testing of deployer change, a bug in the library
is identified. The fix for it is added here too.

Finally, the local dns configuration in flame.sh is updated so that it
can be done correctly across different linux distributions (e.g.,
archlinux and ubuntu). The tests for flame.sh are under archlinux and
ubuntu.

* Add missing merge fix

* fix: out of order message delivery (#354)

In order to minimize the blocking of asynchronous I/O tasks when a
large number of message chunks (e.g., model size is large) need to be
assembled, threading is used. This led to a bug wherein messages are
delivered to a receive queue in an out-of-order fashion. For example,
consider two back-to-back messages; the first message is larger than
the second. In this case, the second can be inserted before the first
because the second has high chance to finish the assembling.

The changes here are for fixing the bug. To do so, we leverage another
queue to synchronize message delivery order. For that, chunk manager
class is introduced. One chunk manager is created for a backend. And
it maintains a list of chunk threads per end, each of which is
responsible for assemling messages sequentially and pushing them into
the receive queue of the end.

* gpu and cpu compatibility pytorch (#350)

This modification will allow trainers/aggregators to work on different devices across different machines.

Basically, all weights are communicated by placing them on the CPU before serializing them.
Then, they are moved back to the device where they were previously.

All trainers now require a self.model attribute and all middle aggregators must only use the CPU (not enforced).

* Make sdk config backwards compatible. (#355)

---------

Co-authored-by: GustavBaumgart <98069699+GustavBaumgart@users.noreply.github.com>
Co-authored-by: Myungjin Lee <myungjin@users.noreply.github.com>
Co-authored-by: vboxuser <vboxuser@Ubuntu.myguest.virtualbox.org>
Co-authored-by: alexandruuBytex <56033021+alexandruuBytex@users.noreply.github.com>
Co-authored-by: Alex Ungurean <aungurea@cisco.com>
Co-authored-by: elqurio <119978637+elqurio@users.noreply.github.com>
@lkurija1 lkurija1 mentioned this pull request Mar 3, 2023
13 tasks
myungjin added a commit to myungjin/flame that referenced this pull request Mar 3, 2023
* optimizer compatibility with tensorflow and example for medmnist keras/pytorch (cisco-open#320)

Tensorflow compatibility for new optimizers was added, which included fedavg, fedadam, fedadagrad, and fedyogi.

A shell script for tesing all 8 possible combinations of optimizers and frameworks is included.
This allows the medmnist example to be run with keras (the folder structure was refactored to include a trainer and aggregator for keras).

The typo in fedavg.py has now been fixed.

* feat+fix: grpc support for hierarchical fl (cisco-open#321)

Hierarchical fl didn't work with grpc as backend. This is because
groupby field was not considered in metaserver service and p2p
backend.

In addition, a middle aggregator hangs even after a job is
completed. This deadlock occurs because p2p backend cleanup code is
called as a part of a channel cleanup. However, in a middle
aggregator, p2p backend is responsible for tasks across all
channnels. The p2p cleanup code couldn't finish cleanup because
a broadcast task for in the other channel can't finish. This bug is
fixed here by getting the p2p backend cleanup code out side of channel
cleanup code.

* documenation for metaserver/mqtt local (cisco-open#322)

Documentation for using metaserver will allow users to run examples with a local broker.
It also allows for mqtt local brokers.
This decreases the chances of any job ID collisions.

Modifications to the config.json for the mnist example were made in order to make it easier to switch to a local broker.
The readme does indicate how to do this for other examples now.

Co-authored-by: vboxuser <vboxuser@Ubuntu.myguest.virtualbox.org>

* feat: asynchronous fl (cisco-open#323)

Asynchronous FL is implemented for two-tier topology and three-tier
hierarchical topology.

The main algorithm is based on the following two papers:
- https://arxiv.org/pdf/2111.04877.pdf
- https://arxiv.org/pdf/2106.06639.pdf

Two examples for asynchronous fl are also added. One is for a two-tier
topology and the other for a three-tier hierarchical topology.

This implementation includes the core algorithm but  doesn't include
SecAgg algorithm (presented in the papers), which is not the scope of
this change.

* fix+refactor: asyncfl loss divergence (cisco-open#330)

For asyncfl, a client (trainer) should send delta by subtracting local
weights from original global weights after training. In the current
implementation, the whole local weights were sent to a
server (aggregator). This causes loss divergence.

Supporting delta update requires refactoring of aggregators of
synchronous fl (horizontal/{top_aggregator.py, middle_aggregator.py})
as well as optimizers' do() function.

The changes here support delta update universally across all types of
modes (horizontal synchronous, asynchronous, and hybrid).

* fix: conflict bewtween integer tensor and float tensor (cisco-open#335)

Model architectures can have integer tensors. Applying aggregation on
those tensors results in type mistmatch and throws a runtime error:
"RuntimeError: result type Float can't be cast to the desired output
type Long"

Integer tensors don't matter in back propagation. So, as a workaround
to the issue, we typecast to the original dtype when the original type
is different from the dtype of weighted tensors for aggregation. In
this way, we can keep the model architecture as is.

* refactor: config for hybrid example in library (cisco-open#334)

To enable library-only execution for hybrid example, its configuration
files are updated accordingly. The revised configuration has local
mqtt and p2p broker config and p2p broker is selected.

* misc: asynchronous hierarchical fl example (cisco-open#340)

Since the Flame SDK supports asynchronous FL, we add an example of an
asynchronous hierarchical FL for control plane.

* chore: clean up examples folder (cisco-open#336)

The examples folder at the top level directory has some outdated and
irrelevant files. Those are now removed from the folder.

* fix: workaround for hybrid mode with two p2p backends (cisco-open#345)

Due to grpc/grpc#25364, when two p2p
backends (which rely on grpc and asyncio) are defined, the hybrid mode
example throws an execption: 'BlockingIOError: [Errno 35] Resource
temporarily unavailable'. The issue still appears unresolved. As a
temporary workaround, we use two different types of backends: mqtt for
one and p2p for the other. This means that when this example is
executed, both metaserver and a mqtt broker (e.g., mosquitto) must be
running in the local machine.

* fix: distributed mode (cisco-open#344)

Distributed mode has a bug: before 'weights' is not defined as member
variable, deepcopy(self.weights) in _update_weights() is called.
To address this issue, self.weights is initialized in __init__().

Also, to run a distributed example locally, configuration files are
revised.

* example/implementation for fedprox (cisco-open#339)

This example is similar to the ones seen in the fedprox paper, although it currently does not simmulate stragglers and uses another dataset/architecture.

A few things were changed in order for there to be a simple process for modifying trainers.
This includes a function in util.py and another class variable in the trainer containing information on the client side regularizer.

Additionally, tests are automated (mu=1,0.1,0.01,0.001,0) so running the example generates or modifies existing files in order to provide the propper configuration for an experiment.

* Create diagnose script (cisco-open#348)

* Create diagnose script

* Make the script executable

---------

Co-authored-by: Alex Ungurean <aungurea@cisco.com>

* refactor+fix: configurable deployer / lib regularizer fix (cisco-open#351)

deployer's job template file is hard-coded, which makes it hard to use
different template file at deployment time. Using different different
template file is useful when underlying infrastructure is
different (e.g., k8s vs knative). To support that, template folder and
file is fed as config variables.

Also, deployer's config info is fed as command argument, which is
cumbersome. So, the config parsing part is refactored such that the
info is fed as a configuration file.

During the testing of deployer change, a bug in the library
is identified. The fix for it is added here too.

Finally, the local dns configuration in flame.sh is updated so that it
can be done correctly across different linux distributions (e.g.,
archlinux and ubuntu). The tests for flame.sh are under archlinux and
ubuntu.

* Add missing merge fix

* Make sdk config backwards compatible. (cisco-open#355)

---------

Co-authored-by: GustavBaumgart <98069699+GustavBaumgart@users.noreply.github.com>
Co-authored-by: Myungjin Lee <myungjin@users.noreply.github.com>
Co-authored-by: vboxuser <vboxuser@Ubuntu.myguest.virtualbox.org>
Co-authored-by: alexandruuBytex <56033021+alexandruuBytex@users.noreply.github.com>
Co-authored-by: Alex Ungurean <aungurea@cisco.com>
Co-authored-by: elqurio <119978637+elqurio@users.noreply.github.com>
myungjin added a commit to myungjin/flame that referenced this pull request Mar 3, 2023
* optimizer compatibility with tensorflow and example for medmnist keras/pytorch (cisco-open#320)

Tensorflow compatibility for new optimizers was added, which included fedavg, fedadam, fedadagrad, and fedyogi.

A shell script for tesing all 8 possible combinations of optimizers and frameworks is included.
This allows the medmnist example to be run with keras (the folder structure was refactored to include a trainer and aggregator for keras).

The typo in fedavg.py has now been fixed.

* feat+fix: grpc support for hierarchical fl (cisco-open#321)

Hierarchical fl didn't work with grpc as backend. This is because
groupby field was not considered in metaserver service and p2p
backend.

In addition, a middle aggregator hangs even after a job is
completed. This deadlock occurs because p2p backend cleanup code is
called as a part of a channel cleanup. However, in a middle
aggregator, p2p backend is responsible for tasks across all
channnels. The p2p cleanup code couldn't finish cleanup because
a broadcast task for in the other channel can't finish. This bug is
fixed here by getting the p2p backend cleanup code out side of channel
cleanup code.

* documenation for metaserver/mqtt local (cisco-open#322)

Documentation for using metaserver will allow users to run examples with a local broker.
It also allows for mqtt local brokers.
This decreases the chances of any job ID collisions.

Modifications to the config.json for the mnist example were made in order to make it easier to switch to a local broker.
The readme does indicate how to do this for other examples now.

Co-authored-by: vboxuser <vboxuser@Ubuntu.myguest.virtualbox.org>

* feat: asynchronous fl (cisco-open#323)

Asynchronous FL is implemented for two-tier topology and three-tier
hierarchical topology.

The main algorithm is based on the following two papers:
- https://arxiv.org/pdf/2111.04877.pdf
- https://arxiv.org/pdf/2106.06639.pdf

Two examples for asynchronous fl are also added. One is for a two-tier
topology and the other for a three-tier hierarchical topology.

This implementation includes the core algorithm but  doesn't include
SecAgg algorithm (presented in the papers), which is not the scope of
this change.

* fix+refactor: asyncfl loss divergence (cisco-open#330)

For asyncfl, a client (trainer) should send delta by subtracting local
weights from original global weights after training. In the current
implementation, the whole local weights were sent to a
server (aggregator). This causes loss divergence.

Supporting delta update requires refactoring of aggregators of
synchronous fl (horizontal/{top_aggregator.py, middle_aggregator.py})
as well as optimizers' do() function.

The changes here support delta update universally across all types of
modes (horizontal synchronous, asynchronous, and hybrid).

* fix: conflict bewtween integer tensor and float tensor (cisco-open#335)

Model architectures can have integer tensors. Applying aggregation on
those tensors results in type mistmatch and throws a runtime error:
"RuntimeError: result type Float can't be cast to the desired output
type Long"

Integer tensors don't matter in back propagation. So, as a workaround
to the issue, we typecast to the original dtype when the original type
is different from the dtype of weighted tensors for aggregation. In
this way, we can keep the model architecture as is.

* refactor: config for hybrid example in library (cisco-open#334)

To enable library-only execution for hybrid example, its configuration
files are updated accordingly. The revised configuration has local
mqtt and p2p broker config and p2p broker is selected.

* misc: asynchronous hierarchical fl example (cisco-open#340)

Since the Flame SDK supports asynchronous FL, we add an example of an
asynchronous hierarchical FL for control plane.

* chore: clean up examples folder (cisco-open#336)

The examples folder at the top level directory has some outdated and
irrelevant files. Those are now removed from the folder.

* fix: workaround for hybrid mode with two p2p backends (cisco-open#345)

Due to grpc/grpc#25364, when two p2p
backends (which rely on grpc and asyncio) are defined, the hybrid mode
example throws an execption: 'BlockingIOError: [Errno 35] Resource
temporarily unavailable'. The issue still appears unresolved. As a
temporary workaround, we use two different types of backends: mqtt for
one and p2p for the other. This means that when this example is
executed, both metaserver and a mqtt broker (e.g., mosquitto) must be
running in the local machine.

* fix: distributed mode (cisco-open#344)

Distributed mode has a bug: before 'weights' is not defined as member
variable, deepcopy(self.weights) in _update_weights() is called.
To address this issue, self.weights is initialized in __init__().

Also, to run a distributed example locally, configuration files are
revised.

* example/implementation for fedprox (cisco-open#339)

This example is similar to the ones seen in the fedprox paper, although it currently does not simmulate stragglers and uses another dataset/architecture.

A few things were changed in order for there to be a simple process for modifying trainers.
This includes a function in util.py and another class variable in the trainer containing information on the client side regularizer.

Additionally, tests are automated (mu=1,0.1,0.01,0.001,0) so running the example generates or modifies existing files in order to provide the propper configuration for an experiment.

* Create diagnose script (cisco-open#348)

* Create diagnose script

* Make the script executable

---------

Co-authored-by: Alex Ungurean <aungurea@cisco.com>

* refactor+fix: configurable deployer / lib regularizer fix (cisco-open#351)

deployer's job template file is hard-coded, which makes it hard to use
different template file at deployment time. Using different different
template file is useful when underlying infrastructure is
different (e.g., k8s vs knative). To support that, template folder and
file is fed as config variables.

Also, deployer's config info is fed as command argument, which is
cumbersome. So, the config parsing part is refactored such that the
info is fed as a configuration file.

During the testing of deployer change, a bug in the library
is identified. The fix for it is added here too.

Finally, the local dns configuration in flame.sh is updated so that it
can be done correctly across different linux distributions (e.g.,
archlinux and ubuntu). The tests for flame.sh are under archlinux and
ubuntu.

* Add missing merge fix

* fix: out of order message delivery (cisco-open#354)

In order to minimize the blocking of asynchronous I/O tasks when a
large number of message chunks (e.g., model size is large) need to be
assembled, threading is used. This led to a bug wherein messages are
delivered to a receive queue in an out-of-order fashion. For example,
consider two back-to-back messages; the first message is larger than
the second. In this case, the second can be inserted before the first
because the second has high chance to finish the assembling.

The changes here are for fixing the bug. To do so, we leverage another
queue to synchronize message delivery order. For that, chunk manager
class is introduced. One chunk manager is created for a backend. And
it maintains a list of chunk threads per end, each of which is
responsible for assemling messages sequentially and pushing them into
the receive queue of the end.

* gpu and cpu compatibility pytorch (cisco-open#350)

This modification will allow trainers/aggregators to work on different devices across different machines.

Basically, all weights are communicated by placing them on the CPU before serializing them.
Then, they are moved back to the device where they were previously.

All trainers now require a self.model attribute and all middle aggregators must only use the CPU (not enforced).

* Make sdk config backwards compatible. (cisco-open#355)

---------

Co-authored-by: GustavBaumgart <98069699+GustavBaumgart@users.noreply.github.com>
Co-authored-by: Myungjin Lee <myungjin@users.noreply.github.com>
Co-authored-by: vboxuser <vboxuser@Ubuntu.myguest.virtualbox.org>
Co-authored-by: alexandruuBytex <56033021+alexandruuBytex@users.noreply.github.com>
Co-authored-by: Alex Ungurean <aungurea@cisco.com>
Co-authored-by: elqurio <119978637+elqurio@users.noreply.github.com>
myungjin added a commit to myungjin/flame that referenced this pull request Mar 3, 2023
* optimizer compatibility with tensorflow and example for medmnist keras/pytorch (cisco-open#320)

Tensorflow compatibility for new optimizers was added, which included fedavg, fedadam, fedadagrad, and fedyogi.

A shell script for tesing all 8 possible combinations of optimizers and frameworks is included.
This allows the medmnist example to be run with keras (the folder structure was refactored to include a trainer and aggregator for keras).

The typo in fedavg.py has now been fixed.

* feat+fix: grpc support for hierarchical fl (cisco-open#321)

Hierarchical fl didn't work with grpc as backend. This is because
groupby field was not considered in metaserver service and p2p
backend.

In addition, a middle aggregator hangs even after a job is
completed. This deadlock occurs because p2p backend cleanup code is
called as a part of a channel cleanup. However, in a middle
aggregator, p2p backend is responsible for tasks across all
channnels. The p2p cleanup code couldn't finish cleanup because
a broadcast task for in the other channel can't finish. This bug is
fixed here by getting the p2p backend cleanup code out side of channel
cleanup code.

* documenation for metaserver/mqtt local (cisco-open#322)

Documentation for using metaserver will allow users to run examples with a local broker.
It also allows for mqtt local brokers.
This decreases the chances of any job ID collisions.

Modifications to the config.json for the mnist example were made in order to make it easier to switch to a local broker.
The readme does indicate how to do this for other examples now.

Co-authored-by: vboxuser <vboxuser@Ubuntu.myguest.virtualbox.org>

* feat: asynchronous fl (cisco-open#323)

Asynchronous FL is implemented for two-tier topology and three-tier
hierarchical topology.

The main algorithm is based on the following two papers:
- https://arxiv.org/pdf/2111.04877.pdf
- https://arxiv.org/pdf/2106.06639.pdf

Two examples for asynchronous fl are also added. One is for a two-tier
topology and the other for a three-tier hierarchical topology.

This implementation includes the core algorithm but  doesn't include
SecAgg algorithm (presented in the papers), which is not the scope of
this change.

* fix+refactor: asyncfl loss divergence (cisco-open#330)

For asyncfl, a client (trainer) should send delta by subtracting local
weights from original global weights after training. In the current
implementation, the whole local weights were sent to a
server (aggregator). This causes loss divergence.

Supporting delta update requires refactoring of aggregators of
synchronous fl (horizontal/{top_aggregator.py, middle_aggregator.py})
as well as optimizers' do() function.

The changes here support delta update universally across all types of
modes (horizontal synchronous, asynchronous, and hybrid).

* fix: conflict bewtween integer tensor and float tensor (cisco-open#335)

Model architectures can have integer tensors. Applying aggregation on
those tensors results in type mistmatch and throws a runtime error:
"RuntimeError: result type Float can't be cast to the desired output
type Long"

Integer tensors don't matter in back propagation. So, as a workaround
to the issue, we typecast to the original dtype when the original type
is different from the dtype of weighted tensors for aggregation. In
this way, we can keep the model architecture as is.

* refactor: config for hybrid example in library (cisco-open#334)

To enable library-only execution for hybrid example, its configuration
files are updated accordingly. The revised configuration has local
mqtt and p2p broker config and p2p broker is selected.

* misc: asynchronous hierarchical fl example (cisco-open#340)

Since the Flame SDK supports asynchronous FL, we add an example of an
asynchronous hierarchical FL for control plane.

* chore: clean up examples folder (cisco-open#336)

The examples folder at the top level directory has some outdated and
irrelevant files. Those are now removed from the folder.

* fix: workaround for hybrid mode with two p2p backends (cisco-open#345)

Due to grpc/grpc#25364, when two p2p
backends (which rely on grpc and asyncio) are defined, the hybrid mode
example throws an execption: 'BlockingIOError: [Errno 35] Resource
temporarily unavailable'. The issue still appears unresolved. As a
temporary workaround, we use two different types of backends: mqtt for
one and p2p for the other. This means that when this example is
executed, both metaserver and a mqtt broker (e.g., mosquitto) must be
running in the local machine.

* fix: distributed mode (cisco-open#344)

Distributed mode has a bug: before 'weights' is not defined as member
variable, deepcopy(self.weights) in _update_weights() is called.
To address this issue, self.weights is initialized in __init__().

Also, to run a distributed example locally, configuration files are
revised.

* example/implementation for fedprox (cisco-open#339)

This example is similar to the ones seen in the fedprox paper, although it currently does not simmulate stragglers and uses another dataset/architecture.

A few things were changed in order for there to be a simple process for modifying trainers.
This includes a function in util.py and another class variable in the trainer containing information on the client side regularizer.

Additionally, tests are automated (mu=1,0.1,0.01,0.001,0) so running the example generates or modifies existing files in order to provide the propper configuration for an experiment.

* Create diagnose script (cisco-open#348)

* Create diagnose script

* Make the script executable

---------

Co-authored-by: Alex Ungurean <aungurea@cisco.com>

* refactor+fix: configurable deployer / lib regularizer fix (cisco-open#351)

deployer's job template file is hard-coded, which makes it hard to use
different template file at deployment time. Using different different
template file is useful when underlying infrastructure is
different (e.g., k8s vs knative). To support that, template folder and
file is fed as config variables.

Also, deployer's config info is fed as command argument, which is
cumbersome. So, the config parsing part is refactored such that the
info is fed as a configuration file.

During the testing of deployer change, a bug in the library
is identified. The fix for it is added here too.

Finally, the local dns configuration in flame.sh is updated so that it
can be done correctly across different linux distributions (e.g.,
archlinux and ubuntu). The tests for flame.sh are under archlinux and
ubuntu.

* Add missing merge fix

* Make sdk config backwards compatible. (cisco-open#355)

---------

Co-authored-by: GustavBaumgart <98069699+GustavBaumgart@users.noreply.github.com>
Co-authored-by: Myungjin Lee <myungjin@users.noreply.github.com>
Co-authored-by: vboxuser <vboxuser@Ubuntu.myguest.virtualbox.org>
Co-authored-by: alexandruuBytex <56033021+alexandruuBytex@users.noreply.github.com>
Co-authored-by: Alex Ungurean <aungurea@cisco.com>
Co-authored-by: elqurio <119978637+elqurio@users.noreply.github.com>
myungjin added a commit that referenced this pull request Mar 6, 2023
* Add group associations to roles (#319)

* Sync up the generated code from openapi generator with what we have currently (#331)

* applied formatting (#341)

* pre-commit setup and dev reqs (#342)

* Refactor config handling with pydantic (#332)

* Make sdk config backwards compatible. (#355)

* Fix merge conflicts between development and main branch (#353)

* optimizer compatibility with tensorflow and example for medmnist keras/pytorch (#320)

Tensorflow compatibility for new optimizers was added, which included fedavg, fedadam, fedadagrad, and fedyogi.

A shell script for tesing all 8 possible combinations of optimizers and frameworks is included.
This allows the medmnist example to be run with keras (the folder structure was refactored to include a trainer and aggregator for keras).

The typo in fedavg.py has now been fixed.

* feat+fix: grpc support for hierarchical fl (#321)

Hierarchical fl didn't work with grpc as backend. This is because
groupby field was not considered in metaserver service and p2p
backend.

In addition, a middle aggregator hangs even after a job is
completed. This deadlock occurs because p2p backend cleanup code is
called as a part of a channel cleanup. However, in a middle
aggregator, p2p backend is responsible for tasks across all
channnels. The p2p cleanup code couldn't finish cleanup because
a broadcast task for in the other channel can't finish. This bug is
fixed here by getting the p2p backend cleanup code out side of channel
cleanup code.

* documenation for metaserver/mqtt local (#322)

Documentation for using metaserver will allow users to run examples with a local broker.
It also allows for mqtt local brokers.
This decreases the chances of any job ID collisions.

Modifications to the config.json for the mnist example were made in order to make it easier to switch to a local broker.
The readme does indicate how to do this for other examples now.

Co-authored-by: vboxuser <vboxuser@Ubuntu.myguest.virtualbox.org>

* feat: asynchronous fl (#323)

Asynchronous FL is implemented for two-tier topology and three-tier
hierarchical topology.

The main algorithm is based on the following two papers:
- https://arxiv.org/pdf/2111.04877.pdf
- https://arxiv.org/pdf/2106.06639.pdf

Two examples for asynchronous fl are also added. One is for a two-tier
topology and the other for a three-tier hierarchical topology.

This implementation includes the core algorithm but  doesn't include
SecAgg algorithm (presented in the papers), which is not the scope of
this change.

* fix+refactor: asyncfl loss divergence (#330)

For asyncfl, a client (trainer) should send delta by subtracting local
weights from original global weights after training. In the current
implementation, the whole local weights were sent to a
server (aggregator). This causes loss divergence.

Supporting delta update requires refactoring of aggregators of
synchronous fl (horizontal/{top_aggregator.py, middle_aggregator.py})
as well as optimizers' do() function.

The changes here support delta update universally across all types of
modes (horizontal synchronous, asynchronous, and hybrid).

* fix: conflict bewtween integer tensor and float tensor (#335)

Model architectures can have integer tensors. Applying aggregation on
those tensors results in type mistmatch and throws a runtime error:
"RuntimeError: result type Float can't be cast to the desired output
type Long"

Integer tensors don't matter in back propagation. So, as a workaround
to the issue, we typecast to the original dtype when the original type
is different from the dtype of weighted tensors for aggregation. In
this way, we can keep the model architecture as is.

* refactor: config for hybrid example in library (#334)

To enable library-only execution for hybrid example, its configuration
files are updated accordingly. The revised configuration has local
mqtt and p2p broker config and p2p broker is selected.

* misc: asynchronous hierarchical fl example (#340)

Since the Flame SDK supports asynchronous FL, we add an example of an
asynchronous hierarchical FL for control plane.

* chore: clean up examples folder (#336)

The examples folder at the top level directory has some outdated and
irrelevant files. Those are now removed from the folder.

* fix: workaround for hybrid mode with two p2p backends (#345)

Due to grpc/grpc#25364, when two p2p
backends (which rely on grpc and asyncio) are defined, the hybrid mode
example throws an execption: 'BlockingIOError: [Errno 35] Resource
temporarily unavailable'. The issue still appears unresolved. As a
temporary workaround, we use two different types of backends: mqtt for
one and p2p for the other. This means that when this example is
executed, both metaserver and a mqtt broker (e.g., mosquitto) must be
running in the local machine.

* fix: distributed mode (#344)

Distributed mode has a bug: before 'weights' is not defined as member
variable, deepcopy(self.weights) in _update_weights() is called.
To address this issue, self.weights is initialized in __init__().

Also, to run a distributed example locally, configuration files are
revised.

* example/implementation for fedprox (#339)

This example is similar to the ones seen in the fedprox paper, although it currently does not simmulate stragglers and uses another dataset/architecture.

A few things were changed in order for there to be a simple process for modifying trainers.
This includes a function in util.py and another class variable in the trainer containing information on the client side regularizer.

Additionally, tests are automated (mu=1,0.1,0.01,0.001,0) so running the example generates or modifies existing files in order to provide the propper configuration for an experiment.

* Create diagnose script (#348)

* Create diagnose script

* Make the script executable

---------

Co-authored-by: Alex Ungurean <aungurea@cisco.com>

* refactor+fix: configurable deployer / lib regularizer fix (#351)

deployer's job template file is hard-coded, which makes it hard to use
different template file at deployment time. Using different different
template file is useful when underlying infrastructure is
different (e.g., k8s vs knative). To support that, template folder and
file is fed as config variables.

Also, deployer's config info is fed as command argument, which is
cumbersome. So, the config parsing part is refactored such that the
info is fed as a configuration file.

During the testing of deployer change, a bug in the library
is identified. The fix for it is added here too.

Finally, the local dns configuration in flame.sh is updated so that it
can be done correctly across different linux distributions (e.g.,
archlinux and ubuntu). The tests for flame.sh are under archlinux and
ubuntu.

* Add missing merge fix

* Make sdk config backwards compatible. (#355)

---------

Co-authored-by: GustavBaumgart <98069699+GustavBaumgart@users.noreply.github.com>
Co-authored-by: Myungjin Lee <myungjin@users.noreply.github.com>
Co-authored-by: vboxuser <vboxuser@Ubuntu.myguest.virtualbox.org>
Co-authored-by: alexandruuBytex <56033021+alexandruuBytex@users.noreply.github.com>
Co-authored-by: Alex Ungurean <aungurea@cisco.com>
Co-authored-by: elqurio <119978637+elqurio@users.noreply.github.com>

* refactor: end-to-end refactoring (#360)

* refactor: end-to-end refactoring

The development branch is yet fully tested. Hence, it contains several
incompatibility and bugs. The following issues are handled:

(1) func tag parsing in config.py (sdk): The config module has a small
bug, which the parsed func tags are not populated in Channel class
instance.

(2) design and schema creation failure (control plane): "name" field
in design and "version" field in schema are not used all the time. But
they are specified as "required" fields, which causes error during
assertion check on these field in openapi code.

(3) hyperparameter update failure in mlflow (sdk): hyperparameter is
no longer a dictionary, which is an expected format from mlflow.

(4) library update for new examples - asyncfl and fedprox (sdk):
asyncfl and fedprox algorithms and examples were introduced outside
the development branch, which caused compatibility issues.

(5) control plane example update (control plane): all the example code
in the control plane is outdated because of configuration parsing
module changes.

(6) README file update in adult and mnist_non_orchestration_mode
examples (doc): these two examples are for non-orchestration
mode. They will be deprecated. So, a note is added to their README
file.

* Update lib/python/flame/registry/mlflow.py

Co-authored-by: elqurio <119978637+elqurio@users.noreply.github.com>

---------

Co-authored-by: openwithcode <123649857+openwithcode@users.noreply.github.com>
Co-authored-by: elqurio <119978637+elqurio@users.noreply.github.com>

---------

Co-authored-by: elqurio <119978637+elqurio@users.noreply.github.com>
Co-authored-by: GustavBaumgart <98069699+GustavBaumgart@users.noreply.github.com>
Co-authored-by: Myungjin Lee <myungjin@users.noreply.github.com>
Co-authored-by: vboxuser <vboxuser@Ubuntu.myguest.virtualbox.org>
Co-authored-by: alexandruuBytex <56033021+alexandruuBytex@users.noreply.github.com>
Co-authored-by: Alex Ungurean <aungurea@cisco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants