The loss cannot get decreased during the training #10

developer-mayuan · 2018-01-11T17:23:55Z

Hi natanielruiz:

I was trying to repeated your paper's result in recent days however I found I cannot get the loss decreased when I trained your model on 300W_LP dataset. I used the same parameters you provided in your paper where

alpha = 1, lr = 1e-5 and default parameters for Adam Optimizer.

I ran your network for 25 epochs and the losses for Yaw is vibrating around 3000 which means the MSE loss is still too large for the yaw degree.

Do you have any idea how to debug the network or solve this issue? Thank you very much for your help!

natanielruiz · 2018-01-11T23:15:00Z

This is very strange.

If the loss doesn't go down in the first epoch it won't go down afterwards. Try lower learning rates, it seems like the training is diverging. The learning rate you're using should be the correct one so this seems very strange. If anyone has this same problem it'd be nice if they can report it here.

numitors · 2018-01-15T09:23:44Z

@developer-mayuan I just managed to run the training without crashing (as per solution that you pointed out) But I seem to be running into the similar problem. This is the output I have:

python2 code/train_hopenet.py --output_string=mine --data_dir=300W_LP --filename_list=300W_LP_filename_filtered.txt --alpha=1 --lr=0.00001
Loading data.
Ready to train network.
Epoch [1/5], Iter [100/7650] Losses: Yaw 2259.5078, Pitch 240.8887, Roll 393.4849
Epoch [1/5], Iter [200/7650] Losses: Yaw 980.5406, Pitch 197.7521, Roll 250.1536
Epoch [1/5], Iter [300/7650] Losses: Yaw 464.2857, Pitch 79.0749, Roll 81.2777
Epoch [1/5], Iter [400/7650] Losses: Yaw 277.2566, Pitch 81.5020, Roll 43.5987
Epoch [1/5], Iter [500/7650] Losses: Yaw 340.9163, Pitch 74.6796, Roll 27.2036
Epoch [1/5], Iter [600/7650] Losses: Yaw 189.3556, Pitch 77.7477, Roll 33.9641
Epoch [1/5], Iter [700/7650] Losses: Yaw 198.5089, Pitch 77.3002, Roll 133.2880
Epoch [1/5], Iter [800/7650] Losses: Yaw 190.2300, Pitch 31.1991, Roll 77.3836
Epoch [1/5], Iter [900/7650] Losses: Yaw 159.5810, Pitch 79.7303, Roll 241.9788
Epoch [1/5], Iter [1000/7650] Losses: Yaw 208.6988, Pitch 46.0669, Roll 69.5470
Epoch [1/5], Iter [1100/7650] Losses: Yaw 95.6048, Pitch 49.7742, Roll 87.0339
Epoch [1/5], Iter [1200/7650] Losses: Yaw 92.9210, Pitch 42.5244, Roll 81.0443
Epoch [1/5], Iter [1300/7650] Losses: Yaw 120.5021, Pitch 44.2593, Roll 122.6854
Epoch [1/5], Iter [1400/7650] Losses: Yaw 479.7826, Pitch 24.0390, Roll 76.1425
Epoch [1/5], Iter [1500/7650] Losses: Yaw 44.8623, Pitch 44.0322, Roll 73.1824
Epoch [1/5], Iter [1600/7650] Losses: Yaw 45.7050, Pitch 26.3968, Roll 118.1652
Epoch [1/5], Iter [1700/7650] Losses: Yaw 101.2617, Pitch 48.9110, Roll 165.2226
Epoch [1/5], Iter [1800/7650] Losses: Yaw 101.9753, Pitch 159.2361, Roll 324.3020
Epoch [1/5], Iter [1900/7650] Losses: Yaw 45.1358, Pitch 44.4955, Roll 116.9922
Epoch [1/5], Iter [2000/7650] Losses: Yaw 108.4551, Pitch 35.4727, Roll 105.5859
Epoch [1/5], Iter [2100/7650] Losses: Yaw 39.3804, Pitch 28.0998, Roll 150.7015
Epoch [1/5], Iter [2200/7650] Losses: Yaw 16.8714, Pitch 23.9571, Roll 61.9345
Epoch [1/5], Iter [2300/7650] Losses: Yaw 79.7481, Pitch 18.3805, Roll 58.4298

It just keeps oscillating. I have also tried different values for alpha and lr, but none of those seem to yield any meaningful outcome.

developer-mayuan · 2018-01-15T16:14:19Z

@numitors My loss is much larger than you. I think you can wait for more epochs to see if the Yaw loss becomes lower. For my case, the Yaw Loss is vibrating around 3000, which means the network totally learned nothing...

developer-mayuan · 2018-01-15T22:27:28Z

I realized a bug in my modified code, now the algorithm works. I will close this issue.

numitors · 2018-01-16T08:27:44Z

@developer-mayuan What was your issue exactly? I also saw the behavior you are describing, depending probably just on initialization.


Epoch [1/5], Iter [100/7650] Losses: Yaw 2680.6248, Pitch 170.9742, Roll 140.1705
Epoch [1/5], Iter [200/7650] Losses: Yaw 2461.8210, Pitch 255.6145, Roll 112.2997
Epoch [1/5], Iter [300/7650] Losses: Yaw 3699.8623, Pitch 236.5356, Roll 362.8195
Epoch [1/5], Iter [400/7650] Losses: Yaw 2488.0469, Pitch 184.1573, Roll 118.9639
Epoch [1/5], Iter [500/7650] Losses: Yaw 2107.6182, Pitch 151.6767, Roll 173.3512
Epoch [1/5], Iter [600/7650] Losses: Yaw 3123.3323, Pitch 216.2974, Roll 147.0356
Epoch [1/5], Iter [700/7650] Losses: Yaw 2840.7883, Pitch 179.2066, Roll 212.2754
Epoch [1/5], Iter [800/7650] Losses: Yaw 3289.1289, Pitch 175.0063, Roll 111.8380
Epoch [1/5], Iter [900/7650] Losses: Yaw 2591.9697, Pitch 121.3488, Roll 77.0318
Epoch [1/5], Iter [1000/7650] Losses: Yaw 3115.7188, Pitch 558.8232, Roll 383.3249
Epoch [1/5], Iter [1100/7650] Losses: Yaw 3913.2673, Pitch 405.8267, Roll 236.7603
Epoch [1/5], Iter [1200/7650] Losses: Yaw 3796.3154, Pitch 182.6230, Roll 272.9708
Epoch [1/5], Iter [1300/7650] Losses: Yaw 3453.5103, Pitch 193.3578, Roll 103.1721
Epoch [1/5], Iter [1400/7650] Losses: Yaw 3417.9277, Pitch 119.0170, Roll 179.9921
Epoch [1/5], Iter [1500/7650] Losses: Yaw 2122.2136, Pitch 278.0559, Roll 233.2100
Epoch [1/5], Iter [1600/7650] Losses: Yaw 2530.1829, Pitch 121.5356, Roll 66.9388
Epoch [1/5], Iter [1700/7650] Losses: Yaw 3053.0569, Pitch 215.7416, Roll 111.5422

And it goes like that on and on.

developer-mayuan · 2018-01-16T16:11:25Z

@numitors I think this issue does related to the initialization. You need to try several times.

kalyo-zjl · 2018-02-13T03:47:34Z

Hi @developer-mayuan,

What result can you get training by yourself on 300W_LP dataset? I manage to train the model, but facing the same problem as you met, eg. training is sensitive to the initialization.
After trying several times, the network finally starts to converge. However, even after 13 epochs, the loss still oscillate badly. (The Yaw loss is relatively stable)
I notice that the doesn't the learning rate decrease in the whole training process? @natanielruiz

developer-mayuan · 2018-02-13T04:49:36Z

@kalyo-zjl I can get my losses to be very low after 25 epochs (loss is around 15 for each degree). You can try to decrease the learning rate maybe every 15 epochs to see if works.

developer-mayuan · 2018-02-13T04:50:48Z

@kalyo-zjl By the way, I suggest using tensorboard to visualize the loss instead of just printing the loss to the console.

kalyo-zjl · 2018-02-13T07:04:54Z

Hi @developer-mayuan,
Thanks for your reply.
What do you mean by (around 50 for each degree). The training loss of mine is as below. As you can see, the Pitch and Roll loss will oscillate sometimes even larger than 100.

Yes, I will try tensorboard later. Did you test the trained model by yourself on ALFW_2000 and how the performance will be?

developer-mayuan · 2018-02-13T17:29:36Z

@kalyo-zjl I just modified my previous response, my loss for each axis is around 15. The training procedure is not very consistent, sometimes you can get a very low losses already after just 200 iteration in yaw angle.

natanielruiz · 2018-03-01T20:14:16Z

@developer-mayuan @kalyo-zjl @numitors I fixed a bug in the training code that, combined with the new PyTorch update, was causing training to be very unstable. Please try it again now.

abhigoku10 · 2018-06-13T06:23:34Z

@developer-mayuan @kalyo-zjl hi guys i needed ur help for following points . Please help

Is there any refernce doccument on training for custom dataset
Can we use this code training on the custom dataset
Can you pls provide the pointers , basically i am bit confused with the required format of the dataset
like i have the images and what is the kind of annotation file i should be having

wanjinchang · 2018-09-06T08:12:05Z

@developer-mayuan I want to know your performance tested on ALFW_2000?? I reproduce this project using TensorFlow,the loss of the three is about 2,but when I test on the ALFW_2000 dataset,the mae is about 17,could not get the performance descripted on the paper.

tx1994108 · 2019-05-15T01:20:20Z

嗨@ developer-mayuan，
谢谢你的回复。
你是什么意思（每个学位大约50）。我的培训损失如下。正如你所看到的，俯仰和翻滚损失有时甚至会超过100. 是的，我稍后会尝试张量板。您是否在ALFW_2000上自行测试了训练有素的模型以及性能如何？

@ kalyo-zjl hello. I want to know what the parameters of your training are. After I trained 30 epochs, I still have 10 losses. thank you！

developer-mayuan closed this as completed Jan 15, 2018

This was referenced Mar 4, 2024

The performance of the pretrained model you provided is somewhat different #1

Closed

Inconsistency in training loss (300W-LP) and testing loss (AFLW2000). What should be the convergence criterion and when to save best model? #131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The loss cannot get decreased during the training #10

The loss cannot get decreased during the training #10

developer-mayuan commented Jan 11, 2018

natanielruiz commented Jan 11, 2018

numitors commented Jan 15, 2018

developer-mayuan commented Jan 15, 2018

developer-mayuan commented Jan 15, 2018

numitors commented Jan 16, 2018

developer-mayuan commented Jan 16, 2018

kalyo-zjl commented Feb 13, 2018

developer-mayuan commented Feb 13, 2018 •

edited

Loading

developer-mayuan commented Feb 13, 2018

kalyo-zjl commented Feb 13, 2018

developer-mayuan commented Feb 13, 2018

natanielruiz commented Mar 1, 2018 •

edited

Loading

abhigoku10 commented Jun 13, 2018

wanjinchang commented Sep 6, 2018

tx1994108 commented May 15, 2019

The loss cannot get decreased during the training #10

The loss cannot get decreased during the training #10

Comments

developer-mayuan commented Jan 11, 2018

natanielruiz commented Jan 11, 2018

numitors commented Jan 15, 2018

developer-mayuan commented Jan 15, 2018

developer-mayuan commented Jan 15, 2018

numitors commented Jan 16, 2018

developer-mayuan commented Jan 16, 2018

kalyo-zjl commented Feb 13, 2018

developer-mayuan commented Feb 13, 2018 • edited Loading

developer-mayuan commented Feb 13, 2018

kalyo-zjl commented Feb 13, 2018

developer-mayuan commented Feb 13, 2018

natanielruiz commented Mar 1, 2018 • edited Loading

abhigoku10 commented Jun 13, 2018

wanjinchang commented Sep 6, 2018

tx1994108 commented May 15, 2019

developer-mayuan commented Feb 13, 2018 •

edited

Loading

natanielruiz commented Mar 1, 2018 •

edited

Loading