-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Flame with grpc backend fails to perform weights aggregation for ResNet50 Model #233
Comments
@gauravgadewar36 BTW, what are your environments (e.g., OS, CPU, memory, etc)? |
@myungjin It is deployed on Kubernetes cluster with one master and 5 worker nodes |
@gauravgadewar36 While debugging the issue, I found out that the trainer code throws an exception.
Can you provide a running example when mqtt backend is specified? If you can provide one, I can change the backend to p2p and further test the issue. Can you also pull https://github.com/myungjin/flame/tree/event_driven_approach again (I further optimized some other code since Friday) and try again? |
@myungjin you can use same example for mqtt backend. |
The error I got is coming from PathMNISTDataset class. Haven't you seen the above error before? |
@myungjin No I have not seen that error. |
Okay. Thanks. Please run the latest branch version and let me know the outcome. |
@gauravgadewar36 I can generate the bug. I will continue to debug. |
@myungjin Apart from setting LOG_LEVEL DEBUG in job-agent.yaml.mustache do I need to do anything else to get trainer pod debug logs ?? |
@gauravgadewar36 I fixed this issue. I updated https://github.com/myungjin/flame/tree/event_driven_approach. Can you try it again? |
@myungjin Will try and let you know the status. |
@myungjin It is working properly. |
Describe the bug
When job is deployed with p2p backend it fails for larger model like ResNet50 (deployed with 2 trainers and 1 aggregator)and works for smaller one's like MobileNetV2 (14 MB). Model aggregation fails with exception live check timeout occurred.
For some reason even after setting log level to Debug trainer pods do not output debug logs.
To Reproduce
Steps to reproduce the behavior:
1] Change the model in the medmnist example with ResNet50.
2] Deploy the job with backend as p2p.
Expected behavior
Aggregator should have performed weights aggregation successfully after receiving model weights from trainers.
Additional context
Error Logs below:
2022-09-16 17:57:27,719 | p2p.py:515 | DEBUG | Thread-2 | _check | live check timeout occured for 9548bcbf7e3cda24321f90bb6a1f3d59abdbf100
2022-09-16 17:57:27,719 | top_aggregator.py:114 | DEBUG | MainThread | _aggregate_weights | No data received from 9548bcbf7e3cda24321f90bb6a1f3d59abdbf100
2022-09-16 17:57:27,720 | fedavg.py:50 | DEBUG | MainThread | do | calling fedavg
2022-09-16 17:57:27,720 | top_aggregator.py:134 | DEBUG | MainThread | _aggregate_weights | failed model aggregation
2022-09-16 17:57:27,981 | p2p.py:223 | DEBUG | Thread-2 | _register_channel | meta_resp from heart beat: status: SUCCESS
2022-09-16 17:57:28,722 | top_aggregator.py:179 | DEBUG | MainThread | run_analysis | running analyzer plugins
2022-09-16 17:57:28,722 | top_aggregator.py:193 | DEBUG | MainThread | save_metrics | saving metrics: {}
2022-09-16 17:57:28,722 | top_aggregator.py:200 | DEBUG | MainThread | increment_round | Incrementing current round: 1
aggregator.logs.log
trainer.logs.log
medmnist.zip
The text was updated successfully, but these errors were encountered: