Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FedMLB baseline #2340

Merged
merged 91 commits into from
Oct 10, 2023
Merged

Add FedMLB baseline #2340

merged 91 commits into from
Oct 10, 2023

Conversation

alessiomora
Copy link
Contributor

@alessiomora alessiomora commented Sep 11, 2023

Issue

Implementation of FedMLB for the SoR inititative.

Description

Implementation of FedMLB for the SoR inititative.

Related issues/PRs

Issue #2048

@jafermarq jafermarq added the summer-of-reproducibility About a baseline for Summer of Reproducibility label Sep 11, 2023
Copy link
Contributor

@jafermarq jafermarq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with previous review.

baselines/fedmlb/pyproject.toml Outdated Show resolved Hide resolved
alessiomora and others added 10 commits September 27, 2023 11:56
Co-authored-by: Javier <jafermarq@users.noreply.github.com>
Co-authored-by: Javier <jafermarq@users.noreply.github.com>
Co-authored-by: Javier <jafermarq@users.noreply.github.com>
Co-authored-by: Javier <jafermarq@users.noreply.github.com>
Co-authored-by: Javier <jafermarq@users.noreply.github.com>
Co-authored-by: Javier <jafermarq@users.noreply.github.com>
python -m fedmlb.dataset_preparation dataset_config.alpha_dirichlet=0.6 total_clients=500
```
Note that, to reproduce those settings, we leverage the `.txt` files
contained in the `client_data` folder in this project. Such files store
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we could remove the client_data files from the directory ? (they are ~7MB in total). Is there an obvious way of constructing those files via a not-too-complex script? -- we can naturally request people to git-clone them from the original repo you mention below, but that might not be always reliable.

Copy link
Contributor Author

@alessiomora alessiomora Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those files could be constructed via a script by using a Dirichlet distribution just as the original paper. The files you mention, in fact, contain the IDs of the images assigned to that client, leveraging a Dirichlet distribution with a specific concentration parameters to select a certain number of images for a certain label (and then randomly selecting that number of images in the pool of images with that label). Obviously, if you re-run such a script, you could not be able to reproduce that specific per-client dataset compositions unless you know the seed used to set the pseudo-random generation of numbers (and probably running in the same machine).

For this reason, for reproducibility puproses, I decided to exactly compose the clients' dataset as they were crafted in the original paper.

So, in principle, I can produce a script that generates the composition of datasets (basically the .txt files) that follows a Dirichlet distribution of labels among clients with a certain concentration parameter (but datasets would be different from the original code), or I can find a better way of storing the data contained in the files under client_data.

In the original code, you can find those generation scripts here.
For now, I've deleted some unused .txt files from the folder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll think about this and discuss with the others in the team. Now the files are 4MB (down from 7MB) so that's nice to see. Th
ere are other baselines that also have some not-so-small files as part of their proposed PR, so I'll update this thread once i figure out what's the best way to deal with these. Maybe keeping them is fine. Let's see...

Copy link
Contributor

@jafermarq jafermarq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alessiomora ,

Just a small comment for the pyproject.toml. I also enabled the tests but a small formatting issue was flagged.

baselines/fedmlb/pyproject.toml Outdated Show resolved Hide resolved
jafermarq
jafermarq previously approved these changes Oct 10, 2023
Copy link
Contributor

@jafermarq jafermarq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@jafermarq jafermarq changed the title FedMLB Add FedMLB baseline Oct 10, 2023
@jafermarq jafermarq merged commit 4f9ce5c into adap:main Oct 10, 2023
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
summer-of-reproducibility About a baseline for Summer of Reproducibility
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants