[BUG] Flame with grpc backend fails to perform weights aggregation for ResNet50 Model #233

gauravgadewar36 · 2022-09-20T21:06:14Z

Describe the bug
When job is deployed with p2p backend it fails for larger model like ResNet50 (deployed with 2 trainers and 1 aggregator)and works for smaller one's like MobileNetV2 (14 MB). Model aggregation fails with exception live check timeout occurred.

For some reason even after setting log level to Debug trainer pods do not output debug logs.

To Reproduce
Steps to reproduce the behavior:

1] Change the model in the medmnist example with ResNet50.
2] Deploy the job with backend as p2p.

Expected behavior
Aggregator should have performed weights aggregation successfully after receiving model weights from trainers.

Additional context
Error Logs below:

2022-09-16 17:57:27,719 | p2p.py:515 | DEBUG | Thread-2 | _check | live check timeout occured for 9548bcbf7e3cda24321f90bb6a1f3d59abdbf100
2022-09-16 17:57:27,719 | top_aggregator.py:114 | DEBUG | MainThread | _aggregate_weights | No data received from 9548bcbf7e3cda24321f90bb6a1f3d59abdbf100
2022-09-16 17:57:27,720 | fedavg.py:50 | DEBUG | MainThread | do | calling fedavg
2022-09-16 17:57:27,720 | top_aggregator.py:134 | DEBUG | MainThread | _aggregate_weights | failed model aggregation
2022-09-16 17:57:27,981 | p2p.py:223 | DEBUG | Thread-2 | _register_channel | meta_resp from heart beat: status: SUCCESS
2022-09-16 17:57:28,722 | top_aggregator.py:179 | DEBUG | MainThread | run_analysis | running analyzer plugins
2022-09-16 17:57:28,722 | top_aggregator.py:193 | DEBUG | MainThread | save_metrics | saving metrics: {}
2022-09-16 17:57:28,722 | top_aggregator.py:200 | DEBUG | MainThread | increment_round | Incrementing current round: 1

aggregator.logs.log
trainer.logs.log
medmnist.zip

myungjin · 2022-09-21T20:19:00Z

@gauravgadewar36 BTW, what are your environments (e.g., OS, CPU, memory, etc)?

gauravgadewar36 · 2022-09-21T20:25:32Z

@myungjin It is deployed on Kubernetes cluster with one master and 5 worker nodes
Each node OS is Ubuntu 20.04 and has 20 CPU cores with 65 GB memory

myungjin · 2022-09-27T01:31:45Z

@gauravgadewar36 While debugging the issue, I found out that the trainer code throws an exception.

ERROR: Uncaught exception:
Traceback (most recent call last):
  File "/Users/myungjle/Downloads/medmnist/trainer/main.py", line 194, in <module>
    t.run()
  File "/Users/myungjle/.pyenv/versions/3.9.6/lib/python3.9/site-packages/flame/mode/horizontal/trainer.py", line 179, in run
    self.composer.run()
  File "/Users/myungjle/.pyenv/versions/3.9.6/lib/python3.9/site-packages/flame/mode/composer.py", line 100, in run
    tasklet.do()
  File "/Users/myungjle/.pyenv/versions/3.9.6/lib/python3.9/site-packages/flame/mode/tasklet.py", line 128, in do
    self.func(*self.args)
  File "/Users/myungjle/Downloads/medmnist/trainer/main.py", line 101, in load_data
    train_dataset = PathMNISTDataset(split='train',
  File "/Users/myungjle/Downloads/medmnist/trainer/main.py", line 28, in __init__
    self.imgs = npz_file['train_images']
  File "/Users/myungjle/.pyenv/versions/3.9.6/lib/python3.9/site-packages/numpy/lib/npyio.py", line 249, in __getitem__
    raise KeyError("%s is not a file in the archive" % key)
KeyError: 'train_images is not a file in the archive'

Can you provide a running example when mqtt backend is specified? If you can provide one, I can change the backend to p2p and further test the issue.

Can you also pull https://github.com/myungjin/flame/tree/event_driven_approach again (I further optimized some other code since Friday) and try again?

gauravgadewar36 · 2022-09-27T02:29:53Z

@myungjin you can use same example for mqtt backend.
I have ran the same example with 2 trainer pods and ResNet50 model it works for mqtt but hangs for p2p.
I will take the latest pull and will try to run on it.

myungjin · 2022-09-27T03:02:54Z

@myungjin you can use same example for mqtt backend.
I have ran the same example with 2 trainer pods and ResNet50 model it works for mqtt but hangs for p2p.
I will take the latest pull and will try to run on it.

The error I got is coming from PathMNISTDataset class. Haven't you seen the above error before?

gauravgadewar36 · 2022-09-27T03:10:07Z

@myungjin No I have not seen that error.
For me trainer pods never outputs any kind of debug logs may be in latest version I might see it.

myungjin · 2022-09-27T03:14:59Z

@myungjin No I have not seen that error.
For me trainer pods never outputs any kind of debug logs may be in latest version I might see it.

Okay. Thanks. Please run the latest branch version and let me know the outcome.

myungjin · 2022-09-27T03:35:20Z

@gauravgadewar36 I can generate the bug. I will continue to debug.

gauravgadewar36 · 2022-09-27T03:46:44Z

@myungjin Apart from setting LOG_LEVEL DEBUG in job-agent.yaml.mustache do I need to do anything else to get trainer pod debug logs ??

myungjin · 2022-09-28T00:32:06Z

@gauravgadewar36 I fixed this issue. I updated https://github.com/myungjin/flame/tree/event_driven_approach. Can you try it again?

gauravgadewar36 · 2022-09-28T00:34:21Z

@myungjin Will try and let you know the status.
Thanks !!

gauravgadewar36 · 2022-09-28T02:27:57Z

@myungjin It is working properly.

myungjin self-assigned this Sep 20, 2022

myungjin added the bug Something isn't working label Sep 28, 2022

myungjin linked a pull request Sep 28, 2022 that will close this issue

event driven approach #237

Merged

myungjin closed this as completed Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Flame with grpc backend fails to perform weights aggregation for ResNet50 Model #233

[BUG] Flame with grpc backend fails to perform weights aggregation for ResNet50 Model #233

gauravgadewar36 commented Sep 20, 2022

myungjin commented Sep 21, 2022 •

edited

Loading

gauravgadewar36 commented Sep 21, 2022 •

edited

Loading

myungjin commented Sep 27, 2022

gauravgadewar36 commented Sep 27, 2022

myungjin commented Sep 27, 2022

gauravgadewar36 commented Sep 27, 2022 •

edited

Loading

myungjin commented Sep 27, 2022

myungjin commented Sep 27, 2022

gauravgadewar36 commented Sep 27, 2022

myungjin commented Sep 28, 2022

gauravgadewar36 commented Sep 28, 2022

gauravgadewar36 commented Sep 28, 2022

[BUG] Flame with grpc backend fails to perform weights aggregation for ResNet50 Model #233

[BUG] Flame with grpc backend fails to perform weights aggregation for ResNet50 Model #233

Comments

gauravgadewar36 commented Sep 20, 2022

myungjin commented Sep 21, 2022 • edited Loading

gauravgadewar36 commented Sep 21, 2022 • edited Loading

myungjin commented Sep 27, 2022

gauravgadewar36 commented Sep 27, 2022

myungjin commented Sep 27, 2022

gauravgadewar36 commented Sep 27, 2022 • edited Loading

myungjin commented Sep 27, 2022

myungjin commented Sep 27, 2022

gauravgadewar36 commented Sep 27, 2022

myungjin commented Sep 28, 2022

gauravgadewar36 commented Sep 28, 2022

gauravgadewar36 commented Sep 28, 2022

myungjin commented Sep 21, 2022 •

edited

Loading

gauravgadewar36 commented Sep 21, 2022 •

edited

Loading

gauravgadewar36 commented Sep 27, 2022 •

edited

Loading