Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during training #25

Closed
theodupuis opened this issue Aug 19, 2021 · 7 comments
Closed

Error during training #25

theodupuis opened this issue Aug 19, 2021 · 7 comments

Comments

@theodupuis
Copy link

Bug : During the training phase
File "/anaconda/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 161, in scale
assert outputs.is_cuda or outputs.device.type == 'xla'
AssertionError
Exception ignored in: <function tqdm.del at 0x7f9ba338de50>
Traceback (most recent call last):
File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1145, in del
File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1299, in close
File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1492, in display
File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1148, in str
File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1450, in format_dict
TypeError: cannot unpack non-iterable NoneType object

Environment
Please provide some information about the used environment.
Env from the set up using source and not docker
Cmd : nndet_train 1000 --sweep

It seems the issue is related to the fact that TensorMetric not updated to cuda device. The same issue as adressed on Lightning-AI/pytorch-lightning#2274.

@theodupuis
Copy link
Author

Been treated in recent commits

@mibaumgartner
Copy link
Collaborator

Hi @theodupuis ,

I'll look into a better solution for this one.

The temporary one is to set move_metrics_to_cpu=False but I'm not really happy with that. If you encounter any memory leaks, set it to True and downgrade lightning for now.

Best,
Michael

@theodupuis
Copy link
Author

theodupuis commented Aug 19, 2021 via email

@theodupuis
Copy link
Author

theodupuis commented Aug 20, 2021 via email

@mibaumgartner
Copy link
Collaborator

Hi,

the training time of nnDetection should be roughly equal for most (there are some exceptions) data sets: 2 days with mixed precision 3d speed up and 4 days without. Your time sounds quite slow though. Generally speaking there could be two reasons:

  1. PyTorch < 1.9 did not provide training speedup for mixed-precision 3d convs in their pip installable version and it was necessary to build it from source. I didn't test PyTorch 1.9 yet. (the docker build of nnDetection also provides the speedup)

  2. There is a bottleneck in your configuration / setup. This can be identified as follows:
    Check the GPU Util -> it should be high for most of the time if it isn't, there is either a CPU or IO bottleneck. If it is high it is the missing pytorch speed up.
    Check CPU util: if the CPU util is high (and the GPU util isn't) more cpu threads are needed for augmentation (can be adjusted via det_num_threads and depends on your CPU).
    If GPU and CPU util are low, it is an IO bottleneck, it is quite hard to do anything about this (a typical SSD with ~500mb/s read speed ran fine for my experiments).

Best,
Michael

@theodupuis
Copy link
Author

theodupuis commented Aug 20, 2021 via email

@mibaumgartner
Copy link
Collaborator

Hi @theodupuis ,

during the training process, only a single model is trained. The empirical parameters refer to several postprocessing parameters (i.e. IoU threshold for NMS, IoU threshold for Weighted Box Clustering) which do not require additional models (it is not a classical Auto ML approach where models are trained several times). Those parameters are optimized by empirically trying them on the validation data.

Best,
Michael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants