Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding age training #211

Open
zevele opened this issue Jul 31, 2022 · 18 comments
Open

Question regarding age training #211

zevele opened this issue Jul 31, 2022 · 18 comments

Comments

@zevele
Copy link

zevele commented Jul 31, 2022

I'm trying to do the age training - now that the bug #206 was resolved. But it's very slow, after 24 hours I'm on step 83 and Epoch 1, at this rate the training will take about a two years to complete (to epoch 600)... Is it going to get faster? how long does the training suppose to last?

@takuya-takeuchi
Copy link
Owner

@zevele
Please tell me

  • OS
  • GPU is available or not
    • If available, what is a pacakge of FRDN? (e.g. CUDA102)
  • GPU (e.g. TI 1800)

I may be able to reproduce your issue on docker container.

@zevele
Copy link
Author

zevele commented Aug 2, 2022

@zevele Please tell me

Thanks @takuya-takeuchi !

OS - Windows 10
"GPU is available or not" - How do I check?
GPU - NVIDIA GeForce GTX 1050

Is there anything I can do to help debugging this on my system?

@takuya-takeuchi
Copy link
Owner

You can check whether gpu is running or not by

  • Chekc TaskManager GPU tab
  • Use nvidia-smi.exe

Just in case, do you use FaceRecognitionDotNet.CUDAXXX?
FaceRecognitionDotNet rather than FaceRecognitionDotNet.CUDAXXX does not use GPU.

@zevele
Copy link
Author

zevele commented Aug 2, 2022

You can check whether gpu is running or not by

I ran the train again and checked the GPU utilization - and it's not using the GPU. I'm following the wiki regarding training the model . I'm running the following command (as per the instructions).
dotnet add package FaceRecognitionDotNet
I tried also
dotnet add package FaceRecognitionDotNet.CUDA92
But GPU utilization is still zero.

I also see you wrote there:
You should build DlibDotNetNative.dll, DlibDotNetNativeDnn.dll and DlibDotNetNativeDnnAgeClassification.dll with CUDA.
How do I compile them with CUDA? how do I use FaceRecognitionDotNet.CUDAXXX?

@takuya-takeuchi
Copy link
Owner

Which cuda version do you install in your machine?
If install CUDA 11.2, you must install FaceRecognitionDotNet.CUDA112.
You must use proper version FaceRecognitionDotNet corresponding to installed CUDA.
And you must install cudnn.

@zevele
Copy link
Author

zevele commented Aug 3, 2022

Which cuda version do you install in your machine? If install CUDA 11.2, you must install...

Thanks! I tried installing CUDA 11.2 and also downloaded cudnn 11.2. Then I Added the path reference to the cudnn in the environment variables.
added both FaceRecognitionDotNet and FaceRecognitionDotNet.CUDA112:
dotnet add package FaceRecognitionDotNet
dotnet add package FaceRecognitionDotNet.CUDA112

Rebuilt the project using:
dotnet build -c Release
I also tried copying the dll's from the cudnn folder to the output folder.

But still zero GPU utilization. What am I missing?

@takuya-takeuchi
Copy link
Owner

You need not to use FaceRecognitionDotNet.
You should install only FaceRecognitionDotNet.CUDA112.
Please uninstall FaceRecognitionDotNet.

@zevele
Copy link
Author

zevele commented Aug 4, 2022

You need not to use FaceRecognitionDotNet...

Actually I tried that and got an error - so I thought I needed both of them. With both of them the training runs - just without CUDA).

If I do:
dotnet remove package FaceRecognitionDotNet
dotnet add package FaceRecognitionDotNet.CUDA112 (just in case...)
and then dotnet build -c Release
When I run the training I get:

              Epoch: 600
      Learning Rate: 0.001
  Min Learning Rate: 1E-05
     Min Batch Size: 384
Validation Interval: 20
           Use Mean: False

Start load train images
System.TypeInitializationException: The type initializer for 'DlibDotNet.NativeMethods' threw an exception. ---> System.DllNotFoundException: Unable to load DLL 'DlibDotNetNativeDnn': The specified module could not be found. (Exception from HRESULT: 0x8007007E)
   at DlibDotNet.NativeMethods.LossMetric_anet_type_create()
   at DlibDotNet.NativeMethods..cctor()
   --- End of inner exception stack trace ---
   at DlibDotNet.NativeMethods.load_image_matrix(MatrixElementType type, Byte[] path, Int32 pathLength, IntPtr& matrix, IntPtr& error_message)
   at DlibDotNet.Dlib.LoadImageAsMatrix[T](String path)
   at AgeTraining.Program.Load(String type, String directory, String meanImage, IList`1& images, IList`1& labels) in D:\Projects\VS2019\FaceRecognitionDotNet-1.3.0.7\tools\AgeTraining\Program.cs:line 268
   at AgeTraining.Program.Train(String baseName, String dataset, UInt32 epoch, Double learningRate, Double minLearningRate, UInt32 miniBatchSize, UInt32 validation, Boolean useMean) in D:\Projects\VS2019\FaceRecognitionDotNet-1.3.0.7\tools\AgeTraining\Program.cs:line 427```

@takuya-takeuchi
Copy link
Owner

@zevele

I had reprocued your issur but I'm not sure why issue occurs.
When release pacakge, link check should have no erros.
Of course, CUDA112 works fine.

But AgeTraining does not work even though link DlibDotNet.CUDA112.

@takuya-takeuchi
Copy link
Owner

These libraries are deployed to app dir.

  • cublas64_11.dll
  • cublasLt64_11.dll
  • cudnn_adv_infer64_8.dll
  • cudnn_adv_train64_8.dll
  • cudnn_cnn_infer64_8.dll
  • cudnn_cnn_train64_8.dll
  • cudnn_ops_infer64_8.dll
  • cudnn_ops_train64_8.dll
  • cudnn64_8.dll
  • curand64_10.dll
  • cusolver64_11.dll

But error is still alive.

@zevele
Copy link
Author

zevele commented Aug 11, 2022

These libraries are deployed to app dir....

Thanks @takuya-takeuchi

I tried to copy the files manually - but I get the same error. Is there anything else that can be done?

@takuya-takeuchi
Copy link
Owner

takuya-takeuchi commented Aug 14, 2022

Note

Simple dotnet test program links FRDN.CUDA112 works fine with cuda libs.

using System;
using System.Collections.Generic;
using System.Drawing;
using System.Drawing.Imaging;
using System.IO;
using System.Linq;
using System.Net.Http;
using System.Reflection;
using System.Runtime.Serialization.Formatters.Binary;
using DlibDotNet;
using FaceRecognitionDotNet.Extensions;
using Xunit;
using Xunit.Abstractions;

namespace FaceRecognitionDotNet.Tests
{

    public class FaceRecognitionTest
    {

        private readonly string ModelDirectory = "Models";

        public FaceRecognitionTest(ITestOutputHelper testOutputHelper)
        {
            var dir = Environment.GetEnvironmentVariable("FaceRecognitionDotNetModelDir");
            if (Directory.Exists(dir))
            {
                ModelDirectory = dir;
            }
        }

        [Fact]
        public void Test()
        {
            using(var fr = FaceRecognitionDotNet.FaceRecognition.Create(ModelDirectory))
            {
            }
        }

    }

}

But simple console program links FRDN.CUDA112 does not work even if deploy cuda libs.

using System;
using System.Collections.Generic;
using System.Drawing;
using System.IO;
using System.Linq;
using System.Threading.Tasks;

using FaceRecognitionDotNet;

namespace Issue
{

    class Program
    {

        private static void Main(string[] args)
        {
            using(var fr = FaceRecognitionDotNet.FaceRecognition.Create("models"))
            {
                        Console.WriteLine("test");
            }
        }

    }

}

@takuya-takeuchi
Copy link
Owner

takuya-takeuchi commented Aug 14, 2022

And link issue occurs for only CUDA 11.X.
It is very weird.

@takuya-takeuchi
Copy link
Owner

Note

D:\Works\OpenSource\Temp\FaceRecognitionDotNet\17.#211>dotnet run -c Release
Unhandled Exception: System.TypeInitializationException: The type initializer for 'DlibDotNet.NativeMethods' threw an exception. ---> System.DllNotFoundException: Unable to load DLL 'DlibDotNetNativeDnn': 指定されたモジュールが見つかりませ ん。 (Exception from HRESULT: 0x8007007E)
   at DlibDotNet.NativeMethods.LossMetric_anet_type_create()
   at DlibDotNet.NativeMethods..cctor()
   --- End of inner exception stack trace ---
   at DlibDotNet.NativeMethods.get_frontal_face_detector()
   at FaceRecognitionDotNet.FaceRecognition..ctor(String directory)
   at Issue.Program.Main(String[] args) in D:\Works\OpenSource\Temp\FaceRecognitionDotNet\17.#211\Program.cs:line 18

But it works.

D:\Works\OpenSource\Temp\FaceRecognitionDotNet\17.#211\bin\Release\netcoreapp2.0>dotnet Issue.dll
test

It looks like program runs in wrong current directory.
So after deply cuda libs in directory contains *.csproj, it work fine.

D:\Works\OpenSource\Temp\FaceRecognitionDotNet\17.#211>dir
 ドライブ D のボリューム ラベルは Data です
 ボリューム シリアル番号は ACE6-77C8 です

 D:\Works\OpenSource\Temp\FaceRecognitionDotNet\17.#211 のディレクトリ

2022/08/14  16:08    <DIR>          .
2022/08/14  16:08    <DIR>          ..
2022/08/14  15:21    <DIR>          bin
2022/08/14  16:08    <SYMLINK>      cublas64_11.dll [C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cublas64_11.dll]
2022/08/14  16:08    <SYMLINK>      cublasLt64_11.dll [C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cublasLt64_11.dll]
2022/08/14  16:08    <SYMLINK>      cudnn64_8.dll [C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cudnn64_8.dll]
2022/08/14  16:08    <SYMLINK>      cudnn_adv_infer64_8.dll [C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cudnn_adv_infer64_8.dll]
2022/08/14  16:08    <SYMLINK>      cudnn_adv_train64_8.dll [C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cudnn_adv_train64_8.dll]
2022/08/14  16:08    <SYMLINK>      cudnn_cnn_infer64_8.dll [C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cudnn_cnn_infer64_8.dll]
2022/08/14  16:08    <SYMLINK>      cudnn_cnn_train64_8.dll [C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cudnn_cnn_train64_8.dll]
2022/08/14  16:08    <SYMLINK>      cudnn_ops_infer64_8.dll [C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cudnn_ops_infer64_8.dll]
2022/08/14  16:08    <SYMLINK>      cudnn_ops_train64_8.dll [C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cudnn_ops_train64_8.dll]
2022/08/14  16:08    <SYMLINK>      curand64_10.dll [C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\curand64_10.dll]
2022/08/14  16:08    <SYMLINK>      cusolver64_11.dll [C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cusolver64_11.dll]
2022/03/18  21:34            29,546 face1.jpg
2022/03/18  21:34            12,429 face2.jpg
2022/08/14  15:21               376 Issue.csproj
2021/03/27  20:28             1,114 Issue.sln
2021/02/16  01:33           729,940 mmod_human_face_detector.dat
2022/08/14  14:19    <SYMLINKD>     models [D:\Works\OpenSource\FaceRecognitionDotNet.Models]
2022/08/14  15:21    <DIR>          obj
2022/08/14  14:20               428 Program.BAK
2022/08/14  15:12               460 Program.cs
              18 個のファイル             774,293 バイト
               5 個のディレクトリ  836,606,410,752 バイトの空き領域

D:\Works\OpenSource\Temp\FaceRecognitionDotNet\17.#211>dotnet run -c Release
test
```

@zevele
Copy link
Author

zevele commented Aug 14, 2022

Note...

I've tried copying these dlls from the release folder to the solution folder (keeping them in both folders)... now the trainer runs again - but still no GPU utilization:
In the task manager I see the command prompt where I run the trainer: Windows Command Processor has about 30% CPU utilization but no GPU.

Should I try using older versions of CUDA?

I didn't do symlinks and do not have the models link (is it a problem?)


14/08/2022  17:21    <DIR>          .
14/08/2022  17:21    <DIR>          ..
29/07/2022  18:57                69 .gitignore
16/07/2022  23:09    <DIR>          Adience
01/08/2022  09:18        70,887,810 adience-age-network_600_0.001_1E-05_384_False
14/08/2022  17:09                 0 adience-age-network_600_0.001_1E-05_384_False.log
01/08/2022  09:35        70,887,820 adience-age-network_600_0.001_1E-05_384_False_
16/07/2022  23:11    <DIR>          AdienceDataset
16/07/2022  23:47    <DIR>          AdienceDataset_preprocessed
04/08/2022  18:13               840 AgeTraining.csproj
06/08/2022  09:56    <DIR>          bin
15/02/2021  10:07       107,330,560 cublas64_11.dll
15/02/2021  10:07       175,706,112 cublasLt64_11.dll
25/02/2021  10:52           222,720 cudnn64_8.dll
25/02/2021  11:36       128,429,056 cudnn_adv_infer64_8.dll
25/02/2021  11:50        82,672,640 cudnn_adv_train64_8.dll
25/02/2021  11:58       545,695,232 cudnn_cnn_infer64_8.dll
25/02/2021  12:16        87,374,336 cudnn_cnn_train64_8.dll
25/02/2021  11:04       273,139,712 cudnn_ops_infer64_8.dll
25/02/2021  11:18        46,076,416 cudnn_ops_train64_8.dll
15/02/2021  15:38        60,627,968 curand64_10.dll
15/02/2021  15:38       396,296,704 cusolver64_11.dll
30/07/2022  21:43    <DIR>          images
14/08/2022  17:25    <DIR>          obj
29/07/2022  18:57            27,182 Program.cs
29/07/2022  18:57             5,261 README.md
30/07/2022  21:43    <DIR>          tools

@csetuomas
Copy link

I have the same problem. Did you ever solve this problem? It takes life time to train the model with CPU.

@zevele
Copy link
Author

zevele commented Sep 22, 2023

I have the same problem. Did you ever solve this problem? It takes life time to train the model with CPU.
No... I gave up... but if you manage to do it, please share your insights here

@csetuomas
Copy link

csetuomas commented Oct 4, 2023

Just can't get the GPU work. I have tried almost every CUDA versions. I have now been training the model with CPU over 1,5 weeks and about 7% done :| This is just waste of resources. Could someone just share the trained models for age and gender.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants