Error during training #25

theodupuis · 2021-08-19T14:15:29Z

Bug : During the training phase
File "/anaconda/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 161, in scale
assert outputs.is_cuda or outputs.device.type == 'xla'
AssertionError
Exception ignored in: <function tqdm.del at 0x7f9ba338de50>
Traceback (most recent call last):
File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1145, in del
File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1299, in close
File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1492, in display
File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1148, in str
File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1450, in format_dict
TypeError: cannot unpack non-iterable NoneType object

Environment
Please provide some information about the used environment.
Env from the set up using source and not docker
Cmd : nndet_train 1000 --sweep

It seems the issue is related to the fact that TensorMetric not updated to cuda device. The same issue as adressed on Lightning-AI/pytorch-lightning#2274.

theodupuis · 2021-08-19T14:26:06Z

Been treated in recent commits

mibaumgartner · 2021-08-19T14:28:13Z

Hi @theodupuis ,

I'll look into a better solution for this one.

The temporary one is to set move_metrics_to_cpu=False but I'm not really happy with that. If you encounter any memory leaks, set it to True and downgrade lightning for now.

Best,
Michael

theodupuis · 2021-08-19T14:38:46Z

Hi, Thank you for your help ! Best regards Théo Téléchargez Outlook pour iOS<https://aka.ms/o0ukef>

…

________________________________ De : Michael Baumgartner ***@***.***> Envoyé : Thursday, August 19, 2021 4:28:24 PM À : MIC-DKFZ/nnDetection ***@***.***> Cc : Theo Dupuis (Student at CentraleSupelec) ***@***.***>; Mention ***@***.***> Objet : Re: [MIC-DKFZ/nnDetection] Error during training (#25) Hi @theodupuis<https://github.com/theodupuis> , I'll look into a better solution for this one. The temporary one is to set move_metrics_to_cpu=False but I'm not really happy with that. If you encounter any memory leaks, set it to True and downgrade lightning for now. Best, Michael — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#25 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AVIST6I7V2YUA5QLKVF56WLT5UIIRANCNFSM5COKKMTA>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.

theodupuis · 2021-08-20T08:05:44Z

Hi, One last question if I may, now that the training is running, I reached epoch 3 overnight (500images 512x512x100) but is it normal that it takes so much Time? Téléchargez Outlook pour iOS<https://aka.ms/o0ukef>

…

________________________________ De : Theo Dupuis (Student at CentraleSupelec) ***@***.***> Envoyé : Thursday, August 19, 2021 4:38:40 PM À : MIC-DKFZ/nnDetection ***@***.***>; MIC-DKFZ/nnDetection ***@***.***> Cc : Mention ***@***.***> Objet : Re: [MIC-DKFZ/nnDetection] Error during training (#25) Hi, Thank you for your help ! Best regards Théo Téléchargez Outlook pour iOS<https://aka.ms/o0ukef>

________________________________ De : Michael Baumgartner ***@***.***> Envoyé : Thursday, August 19, 2021 4:28:24 PM À : MIC-DKFZ/nnDetection ***@***.***> Cc : Theo Dupuis (Student at CentraleSupelec) ***@***.***>; Mention ***@***.***> Objet : Re: [MIC-DKFZ/nnDetection] Error during training (#25) Hi @theodupuis<https://github.com/theodupuis> , I'll look into a better solution for this one. The temporary one is to set move_metrics_to_cpu=False but I'm not really happy with that. If you encounter any memory leaks, set it to True and downgrade lightning for now. Best, Michael — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#25 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AVIST6I7V2YUA5QLKVF56WLT5UIIRANCNFSM5COKKMTA>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.

mibaumgartner · 2021-08-20T09:06:01Z

Hi,

the training time of nnDetection should be roughly equal for most (there are some exceptions) data sets: 2 days with mixed precision 3d speed up and 4 days without. Your time sounds quite slow though. Generally speaking there could be two reasons:

PyTorch < 1.9 did not provide training speedup for mixed-precision 3d convs in their pip installable version and it was necessary to build it from source. I didn't test PyTorch 1.9 yet. (the docker build of nnDetection also provides the speedup)
There is a bottleneck in your configuration / setup. This can be identified as follows:
Check the GPU Util -> it should be high for most of the time if it isn't, there is either a CPU or IO bottleneck. If it is high it is the missing pytorch speed up.
Check CPU util: if the CPU util is high (and the GPU util isn't) more cpu threads are needed for augmentation (can be adjusted via det_num_threads and depends on your CPU).
If GPU and CPU util are low, it is an IO bottleneck, it is quite hard to do anything about this (a typical SSD with ~500mb/s read speed ran fine for my experiments).

Best,
Michael

theodupuis · 2021-08-20T14:31:26Z

Hi, Thank you for all your answers. One last question and I stop bothering you, if I understand well the algorithm you train several models with different parameters to choose the so called « empirical parameters » right ? Hence the 5 days of training. Thus if this is true how many models are created during the training phase ? Best regards Théo Téléchargez Outlook pour iOS<https://aka.ms/o0ukef>

…

________________________________ De : Michael Baumgartner ***@***.***> Envoyé : Friday, August 20, 2021 11:06:12 AM À : MIC-DKFZ/nnDetection ***@***.***> Cc : Theo Dupuis (Student at CentraleSupelec) ***@***.***>; Mention ***@***.***> Objet : Re: [MIC-DKFZ/nnDetection] Error during training (#25) Hi, the training time of nnDetection should be roughly equal for most (there are some exceptions) data sets: 2 days with mixed precision 3d speed up and 4 days without. Your time sounds quite slow though. Generally speaking there could be two reasons: 1. PyTorch < 1.9 did not provide training speedup for mixed-precision 3d convs in their pip installable version and it was necessary to build it from source. I didn't test PyTorch 1.9 yet. (the docker build of nnDetection also provides the speedup) 2. There is a bottleneck in your configuration / setup. This can be identified as follows: Check the GPU Util -> it should be high for most of the time if it isn't, there is either a CPU or IO bottleneck. If it is high it is the missing pytorch speed up. Check CPU util: if the CPU util is high (and the GPU util isn't) more cpu threads are needed for augmentation (can be adjusted via det_num_threads and depends on your CPU). If GPU and CPU util are low, it is an IO bottleneck, it is quite hard to do anything about this (a typical SSD with ~500mb/s read speed ran fine for my experiments). Best, Michael — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#25 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AVIST6IP27L6G247QDNPJGLT5YLIJANCNFSM5COKKMTA>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.

mibaumgartner · 2021-08-24T14:16:26Z

Hi @theodupuis ,

during the training process, only a single model is trained. The empirical parameters refer to several postprocessing parameters (i.e. IoU threshold for NMS, IoU threshold for Weighted Box Clustering) which do not require additional models (it is not a classical Auto ML approach where models are trained several times). Those parameters are optimized by empirically trying them on the validation data.

Best,
Michael

theodupuis closed this as completed Aug 19, 2021

mibaumgartner mentioned this issue Sep 2, 2021

Loss not on correct device when training with move_metrics_to_cpu=True Lightning-AI/pytorch-lightning#9296

Closed

baibaidj mentioned this issue Nov 28, 2021

CUDNN_STATUS_INTERNAL_ERROR during backward #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error during training #25

Error during training #25

theodupuis commented Aug 19, 2021

theodupuis commented Aug 19, 2021

mibaumgartner commented Aug 19, 2021

theodupuis commented Aug 19, 2021 via email

theodupuis commented Aug 20, 2021 via email

mibaumgartner commented Aug 20, 2021

theodupuis commented Aug 20, 2021 via email

mibaumgartner commented Aug 24, 2021

Error during training #25

Error during training #25

Comments

theodupuis commented Aug 19, 2021

theodupuis commented Aug 19, 2021

mibaumgartner commented Aug 19, 2021

theodupuis commented Aug 19, 2021 via email

theodupuis commented Aug 20, 2021 via email

mibaumgartner commented Aug 20, 2021

theodupuis commented Aug 20, 2021 via email

mibaumgartner commented Aug 24, 2021