Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error after few iteration on training #34

Open
yaju1234 opened this issue Feb 17, 2023 · 5 comments
Open

Error after few iteration on training #34

yaju1234 opened this issue Feb 17, 2023 · 5 comments

Comments

@yaju1234
Copy link

I want to train the network --arch 7 with my custom 62k dataset that is similar to DUTS. I am using 48GB CUDA and batch size 8. After a few iteration, I am getting the following error
Traceback (most recent call last):
File "main.py", line 55, in
main(args)
File "main.py", line 35, in main
Trainer(args, save_path)
File "/root/TRACER/trainer.py", line 58, in init
train_loss, train_mae = self.training(args)
File "/root/TRACER/trainer.py", line 117, in training
loss.backward()
File "/root/TRACER/venv/lib/python3.6/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/TRACER/venv/lib/python3.6/site-packages/torch/autograd/init.py", line 156, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Screenshot from 2023-02-17 06-17-54

@Karel911
Copy link
Owner

Please change the num of batch sizes or num_workers.
It is mainly originated from CUDA, not the codes.

@yaju1234
Copy link
Author

I have changed the batch sizes or num_workers both but get the same error. I am using Quadro RTX 8000 48GB.

@Karel911
Copy link
Owner

Did you use a single gpu? If yes, change multi_gpu=False in config.
Then, try num_workers=0. When we tried it with the same device, we could not reproduce the error.

@yaju1234
Copy link
Author

Now I am getting the following error
Traceback (most recent call last):
File "main.py", line 55, in
main(args)
File "main.py", line 35, in main
Trainer(args, save_path)
File "/home/user/PycharmProjects/TRACER/trainer.py", line 56, in init
train_loss, train_mae = self.training(args)
File "/home/user/PycharmProjects/TRACER/trainer.py", line 106, in training
loss1 = self.criterion(outputs, masks)
File "/home/user/PycharmProjects/TRACER/util/losses.py", line 41, in adaptive_pixel_intensity_loss
bce = F.binary_cross_entropy(pred, mask, reduce=None)
File "/home/user/PycharmProjects/TRACER/venv/lib/python3.6/site-packages/torch/nn/functional.py", line 2915, in binary_cross_entropy
return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum)
RuntimeError: all elements of input should be between 0 and 1
Screenshot from 2023-02-21 22-49-32

@Karel911
Copy link
Owner

It seems there are negative or nan values in your output or labels.
Please check the values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants