Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The loss cannot get decreased during the training #10

Closed
developer-mayuan opened this issue Jan 11, 2018 · 15 comments
Closed

The loss cannot get decreased during the training #10

developer-mayuan opened this issue Jan 11, 2018 · 15 comments

Comments

@developer-mayuan
Copy link

Hi natanielruiz:

I was trying to repeated your paper's result in recent days however I found I cannot get the loss decreased when I trained your model on 300W_LP dataset. I used the same parameters you provided in your paper where

alpha = 1, lr = 1e-5 and default parameters for Adam Optimizer.

I ran your network for 25 epochs and the losses for Yaw is vibrating around 3000 which means the MSE loss is still too large for the yaw degree.

Do you have any idea how to debug the network or solve this issue? Thank you very much for your help!

@natanielruiz
Copy link
Owner

This is very strange.

If the loss doesn't go down in the first epoch it won't go down afterwards. Try lower learning rates, it seems like the training is diverging. The learning rate you're using should be the correct one so this seems very strange. If anyone has this same problem it'd be nice if they can report it here.

@numitors
Copy link

@developer-mayuan I just managed to run the training without crashing (as per solution that you pointed out) But I seem to be running into the similar problem. This is the output I have:

python2 code/train_hopenet.py --output_string=mine --data_dir=300W_LP --filename_list=300W_LP_filename_filtered.txt --alpha=1 --lr=0.00001
Loading data.
Ready to train network.
Epoch [1/5], Iter [100/7650] Losses: Yaw 2259.5078, Pitch 240.8887, Roll 393.4849
Epoch [1/5], Iter [200/7650] Losses: Yaw 980.5406, Pitch 197.7521, Roll 250.1536
Epoch [1/5], Iter [300/7650] Losses: Yaw 464.2857, Pitch 79.0749, Roll 81.2777
Epoch [1/5], Iter [400/7650] Losses: Yaw 277.2566, Pitch 81.5020, Roll 43.5987
Epoch [1/5], Iter [500/7650] Losses: Yaw 340.9163, Pitch 74.6796, Roll 27.2036
Epoch [1/5], Iter [600/7650] Losses: Yaw 189.3556, Pitch 77.7477, Roll 33.9641
Epoch [1/5], Iter [700/7650] Losses: Yaw 198.5089, Pitch 77.3002, Roll 133.2880
Epoch [1/5], Iter [800/7650] Losses: Yaw 190.2300, Pitch 31.1991, Roll 77.3836
Epoch [1/5], Iter [900/7650] Losses: Yaw 159.5810, Pitch 79.7303, Roll 241.9788
Epoch [1/5], Iter [1000/7650] Losses: Yaw 208.6988, Pitch 46.0669, Roll 69.5470
Epoch [1/5], Iter [1100/7650] Losses: Yaw 95.6048, Pitch 49.7742, Roll 87.0339
Epoch [1/5], Iter [1200/7650] Losses: Yaw 92.9210, Pitch 42.5244, Roll 81.0443
Epoch [1/5], Iter [1300/7650] Losses: Yaw 120.5021, Pitch 44.2593, Roll 122.6854
Epoch [1/5], Iter [1400/7650] Losses: Yaw 479.7826, Pitch 24.0390, Roll 76.1425
Epoch [1/5], Iter [1500/7650] Losses: Yaw 44.8623, Pitch 44.0322, Roll 73.1824
Epoch [1/5], Iter [1600/7650] Losses: Yaw 45.7050, Pitch 26.3968, Roll 118.1652
Epoch [1/5], Iter [1700/7650] Losses: Yaw 101.2617, Pitch 48.9110, Roll 165.2226
Epoch [1/5], Iter [1800/7650] Losses: Yaw 101.9753, Pitch 159.2361, Roll 324.3020
Epoch [1/5], Iter [1900/7650] Losses: Yaw 45.1358, Pitch 44.4955, Roll 116.9922
Epoch [1/5], Iter [2000/7650] Losses: Yaw 108.4551, Pitch 35.4727, Roll 105.5859
Epoch [1/5], Iter [2100/7650] Losses: Yaw 39.3804, Pitch 28.0998, Roll 150.7015
Epoch [1/5], Iter [2200/7650] Losses: Yaw 16.8714, Pitch 23.9571, Roll 61.9345
Epoch [1/5], Iter [2300/7650] Losses: Yaw 79.7481, Pitch 18.3805, Roll 58.4298

It just keeps oscillating. I have also tried different values for alpha and lr, but none of those seem to yield any meaningful outcome.

@developer-mayuan
Copy link
Author

@numitors My loss is much larger than you. I think you can wait for more epochs to see if the Yaw loss becomes lower. For my case, the Yaw Loss is vibrating around 3000, which means the network totally learned nothing...

@developer-mayuan
Copy link
Author

I realized a bug in my modified code, now the algorithm works. I will close this issue.

@numitors
Copy link

@developer-mayuan What was your issue exactly? I also saw the behavior you are describing, depending probably just on initialization.


Epoch [1/5], Iter [100/7650] Losses: Yaw 2680.6248, Pitch 170.9742, Roll 140.1705
Epoch [1/5], Iter [200/7650] Losses: Yaw 2461.8210, Pitch 255.6145, Roll 112.2997
Epoch [1/5], Iter [300/7650] Losses: Yaw 3699.8623, Pitch 236.5356, Roll 362.8195
Epoch [1/5], Iter [400/7650] Losses: Yaw 2488.0469, Pitch 184.1573, Roll 118.9639
Epoch [1/5], Iter [500/7650] Losses: Yaw 2107.6182, Pitch 151.6767, Roll 173.3512
Epoch [1/5], Iter [600/7650] Losses: Yaw 3123.3323, Pitch 216.2974, Roll 147.0356
Epoch [1/5], Iter [700/7650] Losses: Yaw 2840.7883, Pitch 179.2066, Roll 212.2754
Epoch [1/5], Iter [800/7650] Losses: Yaw 3289.1289, Pitch 175.0063, Roll 111.8380
Epoch [1/5], Iter [900/7650] Losses: Yaw 2591.9697, Pitch 121.3488, Roll 77.0318
Epoch [1/5], Iter [1000/7650] Losses: Yaw 3115.7188, Pitch 558.8232, Roll 383.3249
Epoch [1/5], Iter [1100/7650] Losses: Yaw 3913.2673, Pitch 405.8267, Roll 236.7603
Epoch [1/5], Iter [1200/7650] Losses: Yaw 3796.3154, Pitch 182.6230, Roll 272.9708
Epoch [1/5], Iter [1300/7650] Losses: Yaw 3453.5103, Pitch 193.3578, Roll 103.1721
Epoch [1/5], Iter [1400/7650] Losses: Yaw 3417.9277, Pitch 119.0170, Roll 179.9921
Epoch [1/5], Iter [1500/7650] Losses: Yaw 2122.2136, Pitch 278.0559, Roll 233.2100
Epoch [1/5], Iter [1600/7650] Losses: Yaw 2530.1829, Pitch 121.5356, Roll 66.9388
Epoch [1/5], Iter [1700/7650] Losses: Yaw 3053.0569, Pitch 215.7416, Roll 111.5422

And it goes like that on and on.

@developer-mayuan
Copy link
Author

@numitors I think this issue does related to the initialization. You need to try several times.

@kalyo-zjl
Copy link

Hi @developer-mayuan,

What result can you get training by yourself on 300W_LP dataset? I manage to train the model, but facing the same problem as you met, eg. training is sensitive to the initialization.
After trying several times, the network finally starts to converge. However, even after 13 epochs, the loss still oscillate badly. (The Yaw loss is relatively stable)
I notice that the doesn't the learning rate decrease in the whole training process? @natanielruiz

@developer-mayuan
Copy link
Author

developer-mayuan commented Feb 13, 2018

@kalyo-zjl I can get my losses to be very low after 25 epochs (loss is around 15 for each degree). You can try to decrease the learning rate maybe every 15 epochs to see if works.

@developer-mayuan
Copy link
Author

@kalyo-zjl By the way, I suggest using tensorboard to visualize the loss instead of just printing the loss to the console.

@kalyo-zjl
Copy link

Hi @developer-mayuan,
Thanks for your reply.
What do you mean by (around 50 for each degree). The training loss of mine is as below. As you can see, the Pitch and Roll loss will oscillate sometimes even larger than 100.
capture
Yes, I will try tensorboard later. Did you test the trained model by yourself on ALFW_2000 and how the performance will be?

@developer-mayuan
Copy link
Author

@kalyo-zjl I just modified my previous response, my loss for each axis is around 15. The training procedure is not very consistent, sometimes you can get a very low losses already after just 200 iteration in yaw angle.

@natanielruiz
Copy link
Owner

natanielruiz commented Mar 1, 2018

@developer-mayuan @kalyo-zjl @numitors I fixed a bug in the training code that, combined with the new PyTorch update, was causing training to be very unstable. Please try it again now.

@abhigoku10
Copy link

@developer-mayuan @kalyo-zjl hi guys i needed ur help for following points . Please help

  1. Is there any refernce doccument on training for custom dataset
  2. Can we use this code training on the custom dataset
    Can you pls provide the pointers , basically i am bit confused with the required format of the dataset
    like i have the images and what is the kind of annotation file i should be having

@wanjinchang
Copy link

@developer-mayuan I want to know your performance tested on ALFW_2000?? I reproduce this project using TensorFlow,the loss of the three is about 2,but when I test on the ALFW_2000 dataset,the mae is about 17,could not get the performance descripted on the paper.

@tx1994108
Copy link

嗨@ developer-mayuan,
谢谢你的回复。
你是什​​么意思(每个学位大约50)。我的培训损失如下。正如你所看到的,俯仰和翻滚损失有时甚至会超过100. 是的,我稍后会尝试张量板。您是否在ALFW_2000上自行测试了训练有素的模型以及性能如何?
捕获

@ kalyo-zjl hello. I want to know what the parameters of your training are. After I trained 30 epochs, I still have 10 losses. thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants