nan loss after 5 epochs on custom dataset #2

makaveli10 · 2022-08-08T06:35:49Z

Hi,
Thanks for sharing your work.
I was training on a custom dataset.
The losses after 6 epochs are nan. Tried reducing the learning rate but that didnt help either. Wondering if @abhi1kumar you encountered this issue while training.

INFO  ------ TRAIN EPOCH 006 ------
INFO  Learning Rate: 0.001250
INFO  Weights:  depth_:nan, heading_:nan, offset2d_:1.0000, offset3d_:nan, seg_:1.0000, size2d_:1.0000, size3d_:nan,
INFO  BATCH[0020/3150] depth_loss:nan, heading_loss:nan, offset2d_loss:nan, offset3d_loss:nan, seg_loss:nan, size2d_loss:nan, size3d_loss:nan,
INFO  BATCH[0040/3150] depth_loss:nan, heading_loss:nan, offset2d_loss:nan, offset3d_loss:nan, seg_loss:nan, size2d_loss:nan, size3d_loss:nan,
INFO  BATCH[0060/3150] depth_loss:nan, heading_loss:nan, offset2d_loss:nan, offset3d_loss:nan, seg_loss:nan, size2d_loss:nan, size3d_loss:nan,
INFO  BATCH[0080/3150] depth_loss:nan, heading_loss:nan, offset2d_loss:nan, offset3d_loss:nan, seg_loss:nan, size2d_loss:nan, size3d_loss:nan,
INFO  BATCH[0100/3150] depth_loss:nan, heading_loss:nan, offset2d_loss:nan, offset3d_loss:nan, seg_loss:nan, size2d_loss:nan, size3d_loss:nan,
INFO  BATCH[0120/3150] depth_loss:nan, heading_loss:nan, offset2d_loss:nan, offset3d_loss:nan, seg_loss:nan, size2d_loss:nan, size3d_loss:nan,

Before epoch 6 losses are reducing as expected.

The text was updated successfully, but these errors were encountered:

abhi1kumar · 2022-08-08T15:51:10Z

Hi @makaveli10
Thank you for showing interest in our work. Here are a few stuff which I would try:

Please try training on KITTI dataset first and see if you reencounter this error.
Next, check if the input label files of the dataset are in the KITTI format. I suspect these are not since the weights for the 3D labels seem to be wrong. It would help if you also tried visualizing labels on top of the images
Finally, check if the data resolution feed to the model is correct.
Try changing the seed (May be this is 1 in a million chance)

makaveli10 · 2022-08-08T16:15:47Z

Thanks for your quick response @abhi1kumar.

I have used the custom dataset with mmdetection3d and gives expected results.
So, as for the resolution I am using the resolution as in the configurations files but my images are (1224, 370) which is the image size in kitti config.
Also, i tried visualizing the labels on top of images which seems exactly correct. I can share those with you if you want.
Other than that reducing the learning rate seems to have no effect on this issue.
Thanks

abhi1kumar · 2022-08-08T16:59:53Z

This is pretty strange.

I have used the custom dataset with mmdetection3d and gives expected results.

Our DEVIANT codebase is essentially a fork of GUPNet codebase, which is not mature compared to mmdetection3d.

Did you try with KITTI? KITTI is small to download and should be easy to run.

So, as for the resolution I am using the resolution as in the configurations files but my images are (1224, 370) which is the image size in kitti config.

It is fine.

Also, i tried visualizing the labels on top of images which seems exactly correct. I can share those with you if you want.

I hope you checked to plot with our plot/plot_qualitative_output.py with --dataset kitti --show_gt_in_image option.
You have to change paths on these lines for your dataset.

Other than that reducing the learning rate seems to have no effect on this issue.

Your 3D dimensions are exploding as well, which is alarming to me. Could you try switching off the depth and projected center part in the loss and see if you still encounter NaNs?

Also, do your labels contain the three classes, or does it contain more classes? The DEVIANT dataloader for the KITTI supports three classes with the following dimensions. See here

xingshuohan · 2022-08-28T08:32:06Z

I met the same issue when training KITTI...

abhi1kumar · 2022-08-29T04:40:28Z

I met the same issue when training KITTI...

Since I am not able to reproduce your issue on our servers, could you paste the training log here.
Also are you able to reproduce our Val 1 numbers by running inference on the KITTI Val 1 model?

xingshuohan · 2022-08-31T05:36:49Z

20220830_022920.txt

BTW, I only modified the training and validation data in the folder of ImageSets.

I can successfully run the inference code.

abhi1kumar · 2022-08-31T16:55:15Z

I can successfully run the inference code.

That is great.

BTW, I only modified the training and validation data in the folder of ImageSets.

I also see that you use a bigger batch size. I do not think switching to a different KITTI data split should be an issue. However, our DEVIANT codebase is essentially a fork of GUPNet codebase, which is not robust. My best guess is that there is a bug in the GUP Net code or may be the seed is a problem.

Please try re-running the experiment or switching to a different seed.

xingshuohan · 2022-09-03T09:27:28Z

Thanks much for your reply.

abhi1kumar · 2022-09-03T15:31:06Z

Thanks much for your reply.

Did your problem get solved? In other words, are you able to train your model on a different KITTI split?

xingshuohan · 2022-09-12T09:44:19Z

Thanks much for your reply.

Did your problem get solved? In other words, are you able to train your model on a different KITTI split?

I am so sorry I am out of the Lab these days. I will give you feedback ASAP.

makaveli10 · 2022-09-13T16:37:16Z

This is pretty strange.

I have used the custom dataset with mmdetection3d and gives expected results.

Our DEVIANT codebase is essentially a fork of GUPNet codebase, which is not mature compared to mmdetection3d.

Did you try with KITTI? KITTI is small to download and should be easy to run.

So, as for the resolution I am using the resolution as in the configurations files but my images are (1224, 370) which is the image size in kitti config.

It is fine.

Also, i tried visualizing the labels on top of images which seems exactly correct. I can share those with you if you want.

I hope you checked to plot with our plot/plot_qualitative_output.py with --dataset kitti --show_gt_in_image option. You have to change paths on these lines for your dataset.

Other than that reducing the learning rate seems to have no effect on this issue.

Your 3D dimensions are exploding as well, which is alarming to me. Could you try switching off the depth and projected center part in the loss and see if you still encounter NaNs?

Also, do your labels contain the three classes, or does it contain more classes? The DEVIANT dataloader for the KITTI supports three classes with the following dimensions. See here

@abhi1kumar Sorry for late response. I still have to test your suggestions. I'll get back to you, thanks alot.

xingshuohan · 2022-09-25T03:49:01Z

Hi, yesterday I trained the model and it works now. The reason is batch_size needs to be divisible by the number of training sets, otherwise it will calculate loss on a single data in the last batch.

abhi1kumar · 2022-09-25T14:32:36Z

Hi, yesterday I trained the model and it works now. The reason is batch_size needs to be divisible by the number of training sets, otherwise it will calculate loss on a single data in the last batch.

Great to know. Could you let us know what is the batch size and the number of training sets? What values did you use for the batch size and the number of training sets.

xingshuohan · 2022-10-03T08:39:35Z

Hi, yesterday I trained the model and it works now. The reason is batch_size needs to be divisible by the number of training sets, otherwise it will calculate loss on a single data in the last batch.

Great to know. Could you let us know what is the batch size and the number of training sets? What values did you use for the batch size and the number of training sets.
My case has 5985 training images in Kitti, so the batch size is set as 15.

15171452351 · 2022-11-08T03:09:19Z

Hi,I have the same problem as you. How do you solve it

zhaowei0315 · 2022-11-23T06:30:30Z

Hi, I set batch size = 1, but the issue of nan loss is still after ~5 epochs. any solution here?

abhi1kumar · 2023-02-20T07:37:53Z

Hi,I have the same problem as you. How do you solve it

Hi, I set batch size = 1, but the issue of nan loss is still after ~5 epochs. any solution here?

@15171452351 @zhaowei0315 The NaN issue happens because of the empty images in the training set. Please remove the empty images (images which do not have any objects inside) from the training set and then train the model.

Please see here for more details.

abhi1kumar · 2023-03-10T20:30:55Z

@15171452351 @zhaowei0315
The GUPNet codebase does not compute 2D and 3D losses when there are empty images (no foreground objects) in a batch. We fix this bug with this commit. With this commit, you no longer need to remove empty images from your training set.

abhi1kumar added the enhancement New feature or request label Aug 8, 2022

abhi1kumar changed the title ~~nan loss after 5 epochs~~ nan loss after 5 epochs on custom dataset Aug 8, 2022

abhi1kumar added the bug Something isn't working label Mar 1, 2023

abhi1kumar closed this as completed Apr 25, 2023

abhi1kumar added custom dataset Training on custom or private dataset and removed enhancement New feature or request labels Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nan loss after 5 epochs on custom dataset #2

nan loss after 5 epochs on custom dataset #2

makaveli10 commented Aug 8, 2022 •

edited

Loading

abhi1kumar commented Aug 8, 2022 •

edited

Loading

makaveli10 commented Aug 8, 2022

abhi1kumar commented Aug 8, 2022 •

edited

Loading

xingshuohan commented Aug 28, 2022

abhi1kumar commented Aug 29, 2022

xingshuohan commented Aug 31, 2022

abhi1kumar commented Aug 31, 2022 •

edited

Loading

xingshuohan commented Sep 3, 2022

abhi1kumar commented Sep 3, 2022 •

edited

Loading

xingshuohan commented Sep 12, 2022

makaveli10 commented Sep 13, 2022

xingshuohan commented Sep 25, 2022

abhi1kumar commented Sep 25, 2022

xingshuohan commented Oct 3, 2022

15171452351 commented Nov 8, 2022

zhaowei0315 commented Nov 23, 2022

abhi1kumar commented Feb 20, 2023

abhi1kumar commented Mar 10, 2023 •

edited

Loading

nan loss after 5 epochs on custom dataset #2

nan loss after 5 epochs on custom dataset #2

Comments

makaveli10 commented Aug 8, 2022 • edited Loading

abhi1kumar commented Aug 8, 2022 • edited Loading

makaveli10 commented Aug 8, 2022

abhi1kumar commented Aug 8, 2022 • edited Loading

xingshuohan commented Aug 28, 2022

abhi1kumar commented Aug 29, 2022

xingshuohan commented Aug 31, 2022

abhi1kumar commented Aug 31, 2022 • edited Loading

xingshuohan commented Sep 3, 2022

abhi1kumar commented Sep 3, 2022 • edited Loading

xingshuohan commented Sep 12, 2022

makaveli10 commented Sep 13, 2022

xingshuohan commented Sep 25, 2022

abhi1kumar commented Sep 25, 2022

xingshuohan commented Oct 3, 2022

15171452351 commented Nov 8, 2022

zhaowei0315 commented Nov 23, 2022

abhi1kumar commented Feb 20, 2023

abhi1kumar commented Mar 10, 2023 • edited Loading

makaveli10 commented Aug 8, 2022 •

edited

Loading

abhi1kumar commented Aug 8, 2022 •

edited

Loading

abhi1kumar commented Aug 8, 2022 •

edited

Loading

abhi1kumar commented Aug 31, 2022 •

edited

Loading

abhi1kumar commented Sep 3, 2022 •

edited

Loading

abhi1kumar commented Mar 10, 2023 •

edited

Loading