Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having runtime error when train your Hopenet #6

Closed
developer-mayuan opened this issue Dec 14, 2017 · 14 comments
Closed

Having runtime error when train your Hopenet #6

developer-mayuan opened this issue Dec 14, 2017 · 14 comments

Comments

@developer-mayuan
Copy link

Hi natanielruiz:

Firstly, I want to say thank you for your great work! I tested your pretrained model on my own dataset and it works great. The result is accurate and robust. Then currently I would like to fine-tune your network with my own dataset, however, I found I cannot do it.

I did prepared the 300W_LP dataset and generated the filelist based on the input of your code. (By the way, maybe you can provide the filelist generation code in your repository, which will make it self-contained.)

Then, we I ran your train_hopenet.py code, sometimes I can got result for 1 or 2 epochs, however, it will always gave me the following error message:

Loading data.
Ready to train network.
Epoch [1/5], Iter [100/7653] Losses: Yaw 4.5354, Pitch 4.0671, Roll 4.2844
/opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THCUNN/generic/ClassNLLCriterion.cu line=87 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "/home/foo/Academy/deep-head-pose/code/train_hopenet.py", line 166, in <module>
    alpha = args.alpha
  File "/home/foo/Ordnance/anaconda2/envs/Hopenet/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/foo/Ordnance/anaconda2/envs/Hopenet/lib/python2.7/site-packages/torch/nn/modules/loss.py", line 482, in forward
    self.ignore_index)
  File "/home/foo/Ordnance/anaconda2/envs/Hopenet/lib/python2.7/site-packages/torch/nn/functional.py", line 746, in cross_entropy
    return nll_loss(log_softmax(input), target, weight, size_average, ignore_index)
  File "/home/foo/Ordnance/anaconda2/envs/Hopenet/lib/python2.7/site-packages/torch/nn/functional.py", line 672, in nll_loss
    return _functions.thnn.NLLLoss.apply(input, target, weight, size_average, ignore_index)
  File "/home/foo/Ordnance/anaconda2/envs/Hopenet/lib/python2.7/site-packages/torch/nn/_functions/thnn/auto.py", line 47, in forward
    output, *ctx.additional_args)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THCUNN/generic/ClassNLLCriterion.cu:87

I did some search, and the most promising answer is the following link:
https://discuss.pytorch.org/t/runtimeerror-cuda-runtime-error-59-device-side-assert-triggered-at-opt-conda-conda-bld-pytorch-1503970438496-work-torch-lib-thc-generic-thcstorage-c-32/9669/5

It sees like in some case your output is out of the bound of the target. The following is my running environment:

Python 2.7.14 (with Anaconda)
Using conda virtual environment
pytorch 0.2.0 py27hc03bea1_4cu80 [cuda80] soumith
torchvision 0.1.9 py27hdb88a65_1 soumith

I would like to know if you meet this kind of problem before and if you can provide me some ideas about how to solving this problem? Thank you very much for your help!

@MichaelYSC
Copy link

@developer-mayuan You should filter the 300W_LP dataset to make sure the three angles int the train list are between -99° and 99°.

@developer-mayuan
Copy link
Author

@MichaelYSC Thank you very much for your help! I don't think this should be the reason that caused the problem. In the 'dataset.py', the following code deals with the binding issue:

bins = np.array(range(-99, 102, 3))
binned_pose = np.digitize([yaw, pitch, roll], bins) - 1

According to the API manual of np.digitize() any value beyond the boundary will automatically be dealt with appropriately.

numpy.digitize(x, bins, right=False)
Return the indices of the bins to which each value in input array belongs.

Each index i returned is such that bins[i-1] <= x < bins[i] if bins is monotonically increasing, or bins[i-1] > x >= bins[i] if bins is monotonically decreasing. If values in x are beyond the bounds of bins, 0 or len(bins) is returned as appropriate. If right is True, then the right bin is closed so that the index i is such that bins[i-1] < x <= bins[i] or bins[i-1] >= x > bins[i] if bins is monotonically increasing or decreasing, respectively.

But I will check it anyway, thank you very much for your inspiration!

@MichaelYSC
Copy link

@developer-mayuan As your description, np.digitize() will return 0 if x < bins[0], then binned_pose = np.digitize( ) - 1 will return -1 and send to calculate CrossEntropyLoss, but CrossEntropyLoss expects a class index (0 to C-1) , so that is a problem.

@ytgcljj
Copy link

ytgcljj commented Dec 15, 2017

@developer-mayuan
I want to ask, how do you run it, write the specific commands to me, I'm not sure how to write the parameters ,thanks
python code/test_on_video_dlib.py --snapshot PATH_OF_SNAPSHOT --face_model PATH_OF_DLIB_MODEL --video PATH_OF_VIDEO --output_string STRING_TO_APPEND_TO_OUTPUT --n_frames N_OF_FRAMES_TO_PROCESS --fps FPS_OF_SOURCE_VIDEO

@developer-mayuan
Copy link
Author

developer-mayuan commented Dec 15, 2017

@ytgcljj The dlib version kind of have some problem when I ran it so I wrote my own version with the same input format. Anyway, you gave gave the previous code a try. The following is the command line:
python code/test_on_video_dlib.py --snapshot snapshots/hopenet_alpha1.pkl --face_model models/mmod_human_face_detector.dat --video PATH_OF_VIDEO --output_string "something you want to add to the output video name and result file name" --n_frames 10 (or some other values, but you can not leave it empty) --fps 30 (or some other value, but you can not leave it empy.)

You can follow my folder structure like the image shown below. You need to download the dlib face detector model and the hopenet model from the link given in the readme file.

screenshot-20171214165738-514x840

You can also refer to my code which modified the dlib+hopenet code.

@developer-mayuan
Copy link
Author

@MichaelYSC Yes, you are right, I will think about how to modify it! Thank you very much for your help!

@developer-mayuan
Copy link
Author

@MichaelYSC After I filtering the data with pose range out of [-99, 99], the training program can be run smoothly. But I still have a problem that the loss at the beginning is extremely large. For example:

Epoch [1/5], Iter [2300/7650] Losses: Yaw 1590537547035833270272.0000, Pitch 1824143341094703202304.0000, Roll 8263841855646658985984.0000

I would like to know if the loss at the beginning looks normal or not in your case. Thanks.

@MichaelYSC
Copy link

@developer-mayuan Did you train on the 300W_LP dataset ? It loos normal in my case.

@developer-mayuan
Copy link
Author

@MichaelYSC Yes, I did train on the 300W_LP but I trained from the scratch. Could this be the reason?

@MichaelYSC
Copy link

@developer-mayuan I am not sure. Maybe we should wait natanielruiz release the train list file.

@developer-mayuan
Copy link
Author

@MichaelYSC Yes, let's do that. :)

@natanielruiz
Copy link
Owner

Hi everyone, I don't have too much time until the end of this week but for now I can say two things:
-The large loss at the beginning of training means the loss is diverging so lower the learning rate. In the paper I use 1e-5 for all layers and 5e-5 for the fc layers I believe.
-That cuda problem OP is referencing is in fact the problem with the angles of more than 99 absolute degrees and filtering the list should fix this problem.

I will release the 300W-LP list at the end of the week when I get a second. Thank you for your patience!

@developer-mayuan
Copy link
Author

@natanielruiz Thank you very much for your response. I really appreciate your help!

@natanielruiz
Copy link
Owner

300W_LP_filename_filtered.txt
AFLW2000_filename_filtered.txt

Here are the filtered filename lists for 300W-LP and AFLW2000. Have fun!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants