-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nan loss after 5 epochs on custom dataset #2
Comments
Hi @makaveli10
|
Thanks for your quick response @abhi1kumar.
|
This is pretty strange.
Our DEVIANT codebase is essentially a fork of GUPNet codebase, which is not mature compared to mmdetection3d. Did you try with KITTI? KITTI is small to download and should be easy to run.
It is fine.
I hope you checked to plot with our
Your 3D dimensions are exploding as well, which is alarming to me. Could you try switching off the depth and projected center part in the loss and see if you still encounter NaNs? Also, do your labels contain the three classes, or does it contain more classes? The DEVIANT dataloader for the KITTI supports three classes with the following dimensions. See here |
I met the same issue when training KITTI... |
|
BTW, I only modified the training and validation data in the folder of ImageSets. I can successfully run the inference code. |
That is great.
I also see that you use a bigger batch size. I do not think switching to a different KITTI data split should be an issue. However, our DEVIANT codebase is essentially a fork of GUPNet codebase, which is not robust. My best guess is that there is a bug in the GUP Net code or may be the seed is a problem. Please try re-running the experiment or switching to a different seed. |
Thanks much for your reply. |
Did your problem get solved? In other words, are you able to train your model on a different KITTI split? |
I am so sorry I am out of the Lab these days. I will give you feedback ASAP. |
@abhi1kumar Sorry for late response. I still have to test your suggestions. I'll get back to you, thanks alot. |
Hi, yesterday I trained the model and it works now. The reason is batch_size needs to be divisible by the number of training sets, otherwise it will calculate loss on a single data in the last batch. |
Great to know. Could you let us know what is the batch size and the number of training sets? What values did you use for the batch size and the number of training sets. |
|
Hi,I have the same problem as you. How do you solve it |
Hi, I set batch size = 1, but the issue of nan loss is still after ~5 epochs. any solution here? |
@15171452351 @zhaowei0315 The NaN issue happens because of the empty images in the training set. Please remove the empty images (images which do not have any objects inside) from the training set and then train the model. Please see here for more details. |
@15171452351 @zhaowei0315 |
Hi,
Thanks for sharing your work.
I was training on a custom dataset.
The losses after 6 epochs are
nan
. Tried reducing the learning rate but that didnt help either. Wondering if @abhi1kumar you encountered this issue while training.Before epoch 6 losses are reducing as expected.
The text was updated successfully, but these errors were encountered: