Having runtime error when train your Hopenet #6

developer-mayuan · 2017-12-14T21:26:02Z

Hi natanielruiz:

Firstly, I want to say thank you for your great work! I tested your pretrained model on my own dataset and it works great. The result is accurate and robust. Then currently I would like to fine-tune your network with my own dataset, however, I found I cannot do it.

I did prepared the 300W_LP dataset and generated the filelist based on the input of your code. (By the way, maybe you can provide the filelist generation code in your repository, which will make it self-contained.)

Then, we I ran your train_hopenet.py code, sometimes I can got result for 1 or 2 epochs, however, it will always gave me the following error message:

Loading data.
Ready to train network.
Epoch [1/5], Iter [100/7653] Losses: Yaw 4.5354, Pitch 4.0671, Roll 4.2844
/opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THCUNN/generic/ClassNLLCriterion.cu line=87 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "/home/foo/Academy/deep-head-pose/code/train_hopenet.py", line 166, in <module>
    alpha = args.alpha
  File "/home/foo/Ordnance/anaconda2/envs/Hopenet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/foo/Ordnance/anaconda2/envs/Hopenet/lib/python2.7/site-packages/torch/nn/modules/loss.py", line 482, in forward
    self.ignore_index)
  File "/home/foo/Ordnance/anaconda2/envs/Hopenet/lib/python2.7/site-packages/torch/nn/functional.py", line 746, in cross_entropy
    return nll_loss(log_softmax(input), target, weight, size_average, ignore_index)
  File "/home/foo/Ordnance/anaconda2/envs/Hopenet/lib/python2.7/site-packages/torch/nn/functional.py", line 672, in nll_loss
    return _functions.thnn.NLLLoss.apply(input, target, weight, size_average, ignore_index)
  File "/home/foo/Ordnance/anaconda2/envs/Hopenet/lib/python2.7/site-packages/torch/nn/_functions/thnn/auto.py", line 47, in forward
    output, *ctx.additional_args)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THCUNN/generic/ClassNLLCriterion.cu:87

I did some search, and the most promising answer is the following link:
https://discuss.pytorch.org/t/runtimeerror-cuda-runtime-error-59-device-side-assert-triggered-at-opt-conda-conda-bld-pytorch-1503970438496-work-torch-lib-thc-generic-thcstorage-c-32/9669/5

It sees like in some case your output is out of the bound of the target. The following is my running environment:

Python 2.7.14 (with Anaconda)
Using conda virtual environment
pytorch 0.2.0 py27hc03bea1_4cu80 [cuda80] soumith
torchvision 0.1.9 py27hdb88a65_1 soumith

I would like to know if you meet this kind of problem before and if you can provide me some ideas about how to solving this problem? Thank you very much for your help!

The text was updated successfully, but these errors were encountered:

MichaelYSC · 2017-12-15T00:22:50Z

@developer-mayuan You should filter the 300W_LP dataset to make sure the three angles int the train list are between -99° and 99°.

developer-mayuan · 2017-12-15T00:31:08Z

@MichaelYSC Thank you very much for your help! I don't think this should be the reason that caused the problem. In the 'dataset.py', the following code deals with the binding issue:

bins = np.array(range(-99, 102, 3))
binned_pose = np.digitize([yaw, pitch, roll], bins) - 1

According to the API manual of np.digitize() any value beyond the boundary will automatically be dealt with appropriately.

numpy.digitize(x, bins, right=False)
Return the indices of the bins to which each value in input array belongs.

Each index i returned is such that bins[i-1] <= x < bins[i] if bins is monotonically increasing, or bins[i-1] > x >= bins[i] if bins is monotonically decreasing. If values in x are beyond the bounds of bins, 0 or len(bins) is returned as appropriate. If right is True, then the right bin is closed so that the index i is such that bins[i-1] < x <= bins[i] or bins[i-1] >= x > bins[i] if bins is monotonically increasing or decreasing, respectively.

But I will check it anyway, thank you very much for your inspiration!

MichaelYSC · 2017-12-15T00:45:00Z

@developer-mayuan As your description, np.digitize() will return 0 if x < bins[0], then binned_pose = np.digitize( ) - 1 will return -1 and send to calculate CrossEntropyLoss, but CrossEntropyLoss expects a class index (0 to C-1) , so that is a problem.

ytgcljj · 2017-12-15T00:50:16Z

@developer-mayuan
I want to ask, how do you run it, write the specific commands to me, I'm not sure how to write the parameters ，thanks
python code/test_on_video_dlib.py --snapshot PATH_OF_SNAPSHOT --face_model PATH_OF_DLIB_MODEL --video PATH_OF_VIDEO --output_string STRING_TO_APPEND_TO_OUTPUT --n_frames N_OF_FRAMES_TO_PROCESS --fps FPS_OF_SOURCE_VIDEO

developer-mayuan · 2017-12-15T00:57:25Z

@ytgcljj The dlib version kind of have some problem when I ran it so I wrote my own version with the same input format. Anyway, you gave gave the previous code a try. The following is the command line:
python code/test_on_video_dlib.py --snapshot snapshots/hopenet_alpha1.pkl --face_model models/mmod_human_face_detector.dat --video PATH_OF_VIDEO --output_string "something you want to add to the output video name and result file name" --n_frames 10 (or some other values, but you can not leave it empty) --fps 30 (or some other value, but you can not leave it empy.)

You can follow my folder structure like the image shown below. You need to download the dlib face detector model and the hopenet model from the link given in the readme file.

You can also refer to my code which modified the dlib+hopenet code.

developer-mayuan · 2017-12-15T01:00:33Z

@MichaelYSC Yes, you are right, I will think about how to modify it! Thank you very much for your help!

developer-mayuan · 2017-12-15T01:58:09Z

@MichaelYSC After I filtering the data with pose range out of [-99, 99], the training program can be run smoothly. But I still have a problem that the loss at the beginning is extremely large. For example:

Epoch [1/5], Iter [2300/7650] Losses: Yaw 1590537547035833270272.0000, Pitch 1824143341094703202304.0000, Roll 8263841855646658985984.0000

I would like to know if the loss at the beginning looks normal or not in your case. Thanks.

MichaelYSC · 2017-12-15T02:22:48Z

@developer-mayuan Did you train on the 300W_LP dataset ? It loos normal in my case.

developer-mayuan · 2017-12-15T02:26:41Z

@MichaelYSC Yes, I did train on the 300W_LP but I trained from the scratch. Could this be the reason?

MichaelYSC · 2017-12-15T02:35:04Z

@developer-mayuan I am not sure. Maybe we should wait natanielruiz release the train list file.

developer-mayuan · 2017-12-15T02:35:42Z

@MichaelYSC Yes, let's do that. :)

natanielruiz · 2017-12-18T05:17:41Z

Hi everyone, I don't have too much time until the end of this week but for now I can say two things:
-The large loss at the beginning of training means the loss is diverging so lower the learning rate. In the paper I use 1e-5 for all layers and 5e-5 for the fc layers I believe.
-That cuda problem OP is referencing is in fact the problem with the angles of more than 99 absolute degrees and filtering the list should fix this problem.

I will release the 300W-LP list at the end of the week when I get a second. Thank you for your patience!

developer-mayuan · 2017-12-18T05:23:11Z

@natanielruiz Thank you very much for your response. I really appreciate your help!

natanielruiz · 2017-12-23T14:04:18Z

300W_LP_filename_filtered.txt
AFLW2000_filename_filtered.txt

Here are the filtered filename lists for 300W-LP and AFLW2000. Have fun!

natanielruiz closed this as completed Dec 23, 2017

developer-mayuan mentioned this issue Jan 13, 2018

*_shape.npy files #11

Closed

java63940 mentioned this issue Jul 17, 2018

train the model on 300W_LP reports the cuda runtime error #27

Closed

iiTzFrankie mentioned this issue Dec 2, 2020

the problem of train #103

Closed

This was referenced Mar 4, 2024

The performance of the pretrained model you provided is somewhat different #1

Closed

Inconsistency in training loss (300W-LP) and testing loss (AFLW2000). What should be the convergence criterion and when to save best model? #131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Having runtime error when train your Hopenet #6

Having runtime error when train your Hopenet #6

developer-mayuan commented Dec 14, 2017

MichaelYSC commented Dec 15, 2017

developer-mayuan commented Dec 15, 2017

MichaelYSC commented Dec 15, 2017

ytgcljj commented Dec 15, 2017

developer-mayuan commented Dec 15, 2017 •

edited

Loading

developer-mayuan commented Dec 15, 2017

developer-mayuan commented Dec 15, 2017

MichaelYSC commented Dec 15, 2017

developer-mayuan commented Dec 15, 2017

MichaelYSC commented Dec 15, 2017

developer-mayuan commented Dec 15, 2017

natanielruiz commented Dec 18, 2017

developer-mayuan commented Dec 18, 2017

natanielruiz commented Dec 23, 2017

Having runtime error when train your Hopenet #6

Having runtime error when train your Hopenet #6

Comments

developer-mayuan commented Dec 14, 2017

MichaelYSC commented Dec 15, 2017

developer-mayuan commented Dec 15, 2017

MichaelYSC commented Dec 15, 2017

ytgcljj commented Dec 15, 2017

developer-mayuan commented Dec 15, 2017 • edited Loading

developer-mayuan commented Dec 15, 2017

developer-mayuan commented Dec 15, 2017

MichaelYSC commented Dec 15, 2017

developer-mayuan commented Dec 15, 2017

MichaelYSC commented Dec 15, 2017

developer-mayuan commented Dec 15, 2017

natanielruiz commented Dec 18, 2017

developer-mayuan commented Dec 18, 2017

natanielruiz commented Dec 23, 2017

developer-mayuan commented Dec 15, 2017 •

edited

Loading