Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flame with grpc backend fails to perform weights aggregation for ResNet50 Model #233

Closed
gauravgadewar36 opened this issue Sep 20, 2022 · 12 comments · Fixed by #237
Closed
Assignees
Labels
bug Something isn't working

Comments

@gauravgadewar36
Copy link

Describe the bug
When job is deployed with p2p backend it fails for larger model like ResNet50 (deployed with 2 trainers and 1 aggregator)and works for smaller one's like MobileNetV2 (14 MB). Model aggregation fails with exception live check timeout occurred.

For some reason even after setting log level to Debug trainer pods do not output debug logs.

To Reproduce
Steps to reproduce the behavior:

1] Change the model in the medmnist example with ResNet50.
2] Deploy the job with backend as p2p.

Expected behavior
Aggregator should have performed weights aggregation successfully after receiving model weights from trainers.

Additional context
Error Logs below:

2022-09-16 17:57:27,719 | p2p.py:515 | DEBUG | Thread-2 | _check | live check timeout occured for 9548bcbf7e3cda24321f90bb6a1f3d59abdbf100
2022-09-16 17:57:27,719 | top_aggregator.py:114 | DEBUG | MainThread | _aggregate_weights | No data received from 9548bcbf7e3cda24321f90bb6a1f3d59abdbf100
2022-09-16 17:57:27,720 | fedavg.py:50 | DEBUG | MainThread | do | calling fedavg
2022-09-16 17:57:27,720 | top_aggregator.py:134 | DEBUG | MainThread | _aggregate_weights | failed model aggregation
2022-09-16 17:57:27,981 | p2p.py:223 | DEBUG | Thread-2 | _register_channel | meta_resp from heart beat: status: SUCCESS
2022-09-16 17:57:28,722 | top_aggregator.py:179 | DEBUG | MainThread | run_analysis | running analyzer plugins
2022-09-16 17:57:28,722 | top_aggregator.py:193 | DEBUG | MainThread | save_metrics | saving metrics: {}
2022-09-16 17:57:28,722 | top_aggregator.py:200 | DEBUG | MainThread | increment_round | Incrementing current round: 1

aggregator.logs.log
trainer.logs.log
medmnist.zip

@myungjin myungjin self-assigned this Sep 20, 2022
@myungjin
Copy link
Contributor

myungjin commented Sep 21, 2022

@gauravgadewar36 BTW, what are your environments (e.g., OS, CPU, memory, etc)?

@gauravgadewar36
Copy link
Author

gauravgadewar36 commented Sep 21, 2022

@myungjin It is deployed on Kubernetes cluster with one master and 5 worker nodes
Each node OS is Ubuntu 20.04 and has 20 CPU cores with 65 GB memory

@myungjin
Copy link
Contributor

@gauravgadewar36 While debugging the issue, I found out that the trainer code throws an exception.

ERROR: Uncaught exception:
Traceback (most recent call last):
  File "/Users/myungjle/Downloads/medmnist/trainer/main.py", line 194, in <module>
    t.run()
  File "/Users/myungjle/.pyenv/versions/3.9.6/lib/python3.9/site-packages/flame/mode/horizontal/trainer.py", line 179, in run
    self.composer.run()
  File "/Users/myungjle/.pyenv/versions/3.9.6/lib/python3.9/site-packages/flame/mode/composer.py", line 100, in run
    tasklet.do()
  File "/Users/myungjle/.pyenv/versions/3.9.6/lib/python3.9/site-packages/flame/mode/tasklet.py", line 128, in do
    self.func(*self.args)
  File "/Users/myungjle/Downloads/medmnist/trainer/main.py", line 101, in load_data
    train_dataset = PathMNISTDataset(split='train',
  File "/Users/myungjle/Downloads/medmnist/trainer/main.py", line 28, in __init__
    self.imgs = npz_file['train_images']
  File "/Users/myungjle/.pyenv/versions/3.9.6/lib/python3.9/site-packages/numpy/lib/npyio.py", line 249, in __getitem__
    raise KeyError("%s is not a file in the archive" % key)
KeyError: 'train_images is not a file in the archive'

Can you provide a running example when mqtt backend is specified? If you can provide one, I can change the backend to p2p and further test the issue.

Can you also pull https://github.com/myungjin/flame/tree/event_driven_approach again (I further optimized some other code since Friday) and try again?

@gauravgadewar36
Copy link
Author

@myungjin you can use same example for mqtt backend.
I have ran the same example with 2 trainer pods and ResNet50 model it works for mqtt but hangs for p2p.
I will take the latest pull and will try to run on it.

@myungjin
Copy link
Contributor

@myungjin you can use same example for mqtt backend.
I have ran the same example with 2 trainer pods and ResNet50 model it works for mqtt but hangs for p2p.
I will take the latest pull and will try to run on it.

The error I got is coming from PathMNISTDataset class. Haven't you seen the above error before?

@gauravgadewar36
Copy link
Author

gauravgadewar36 commented Sep 27, 2022

@myungjin No I have not seen that error.
For me trainer pods never outputs any kind of debug logs may be in latest version I might see it.

@myungjin
Copy link
Contributor

@myungjin No I have not seen that error.
For me trainer pods never outputs any kind of debug logs may be in latest version I might see it.

Okay. Thanks. Please run the latest branch version and let me know the outcome.

@myungjin
Copy link
Contributor

@gauravgadewar36 I can generate the bug. I will continue to debug.

@gauravgadewar36
Copy link
Author

@myungjin Apart from setting LOG_LEVEL DEBUG in job-agent.yaml.mustache do I need to do anything else to get trainer pod debug logs ??

@myungjin
Copy link
Contributor

@gauravgadewar36 I fixed this issue. I updated https://github.com/myungjin/flame/tree/event_driven_approach. Can you try it again?

@gauravgadewar36
Copy link
Author

@myungjin Will try and let you know the status.
Thanks !!

@gauravgadewar36
Copy link
Author

@myungjin It is working properly.

@myungjin myungjin added the bug Something isn't working label Sep 28, 2022
@myungjin myungjin linked a pull request Sep 28, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants