Better output when convergence check succeeds #150

rcurtin · 2019-12-24T02:20:55Z

Issue description

When using callbacks like ens::ProgessBar or similar, and the optimization terminates partway through an epoch due to a convergence check, the output can be confusing:

Epoch 1/29[=========================================================================================>..........] 89% - ETA: 0s - loss: 1115.75

but then output stops.

See the discussion at mlpack/mlpack#2073 for many more details and how to reproduce.

Expected behavior

We should see if we can modify the ens::ProgressBar callback (and perhaps others) to give better output in these situations. Perhaps something like this would be an improvement:

Epoch 1/29[=========================================================================================>..........] 89% - ETA: 0s - loss: 1115.75
Optimization converged! (value 1.33e-6, tolerance 1e-5)

Actual behavior

Instead the output just terminates:

Epoch 1/29[=========================================================================================>..........] 89% - ETA: 0s - loss: 1115.75

This would be a good issue if you are not familiar with how ensmallen's callbacks work and would like to be. I don't have a direct route to a solution here, so my suggestion would be to investigate the current callback code, reproduce the issue, then think about the cleanest way to print convergence information in the ens::ProgressBar callback. Once that's done, see if similar changes might be useful for some of the other callbacks that print information (see http://ensmallen.org/docs.html#callback-documentation for more information on the callbacks that are available).

The text was updated successfully, but these errors were encountered:

vikram2000b · 2020-01-10T07:06:11Z

I would like to take up this issue.

rcurtin · 2020-01-10T14:59:38Z

Hi @vikram2000b, you're more than welcome to. When you have a PR ready, someone will review it when they're able to. 👍

dB2510 · 2020-01-26T05:38:59Z

Can I take up this issue as there is no PR yet?
I am a beginner and I want to explore how this works.

zoq · 2020-01-26T19:23:16Z

@dB2510 I think that is reasonable, @vikram2000b are you fine with that?

dB2510 · 2020-01-28T17:05:48Z

@zoq Any idea how can I reproduce the issue mentioned here mlpack/mlpack#2073 ?

rcurtin · 2020-01-30T02:18:31Z

@dB2510 there is some code in that issue that you can directly compile and use in a program to reproduce the issue. You will probably have to find some datasets to use to reproduce the issue, but that shouldn't be hard---you can probably use some of the datasets in src/mlpack/tests/data/ or something like that.

gaurav-singh1998 · 2020-02-23T16:56:25Z

Hey, @rcurtin wanted to know if this has been solved if not then can I move forward with it?

rcurtin · 2020-02-23T20:15:43Z

@gaurav-singh1998 I have yet to see any PR related to the issue, so yes, if you can solve the issue, please do. 👍

gaurav-singh1998 · 2020-03-08T15:54:34Z

Hi, @rcurtin @zoq while I was through the callback codebase trying to understand it for solving this issue I came across the piece of code given below which iterates over the callbacks and invoke the BeginOptimization() callback if it exists

bool result = false;
    (void)std::initializer_list<bool>{ result =
        result || Callback::BeginOptimizationFunction(callbacks, optimizer,
            function, coordinates)... };
     return result;

I am having a hard time understanding how does this work. Please help understand the underlying concepts behind this. Thanks.

gaurav-singh1998 · 2020-03-08T17:13:06Z

@rcurtin @zoq one more thing, sorry if I irritate you but your help would be very helpful for moving forward with this issue. According to the discussion which happened here, the optimization terminates partway because of the overallObjective value reaching nan or inf or because of std::abs(lastObjective - overallObjective) < tolerance or because of EarlyStopAtMinLoss. This information has to be incorporated in ens::ProgressBar. Am I right? Once again I am sorry to disturb you with these trivial doubts. Thanks

rcurtin · 2020-03-09T13:11:27Z

Hey @gaurav-singh1998, the bit of code in BeginOptimization() terminates the optimization after a callback returns false. We actually just submitted a paper to arXiv today that describes the details; you can find the sources in https://www.github.com/mlpack/ensmallen_tr/, and it will be up on arXiv soon.

I would suggest finding a way to make it so that ProgressBar can realize when optimization has terminated without actually adding all of the convergence checks to the internals of ProgressBar.

gaurav-singh1998 · 2020-03-09T16:05:48Z

Thanks, @rcurtin for the clarification I will try to solve this issue in the next few days.

gaurav-singh1998 · 2020-03-10T19:05:11Z

Hi, @rcurtin while I was going through the ProgressBar code I found that the number of epochs is defined as epochs = optimizer.MaxIterations() / epochSize but shouldn't it be epochs = optimizer.MaxIterations() / function.NumFunctions(). I am saying this because if we were to consider, as an example, suppose there are 120000 data points in a dataset and MaxIterations() parameter of the optimizer is defined as 240000 i.e. double the number of data points in the dataset then according to the equation that I stated epochs will turn out to be 2 which should be correct according to me because an epoch is a single pass through the full training set and here we are supposed to take 2 passes through the dataset. Correct me if I am wrong. Thanks.

rcurtin · 2020-03-11T12:44:45Z

epochSize is defined as function.NumFunctions() * optimizer.BatchSize(). Thus effectively epochSize is equivalent to the number of times the StepTaken() callback is called.

It does seem to me like epochs should then be defined to account for the value of optimizer.BatchSize(), so perhaps there is an issue there.

However, I'm not familiar enough with the code to say for sure. Before making any changes, I would suggest first confirming that this is an issue by creating an example where ProgressBar does not behave as expected, and then open a separate issue in this repository to handle it.

gaurav-singh1998 · 2020-03-11T18:49:18Z

Hi, @rcurtin @zoq @shrit according to the discussion that happened here I recreated the issue, the code regarding it can be found here. The dimensions of the dataset I used was [120000, 23] (considering the row-major matrix). After compiling and executing it the output I got was,

Epoch 1/54
1563/1875 [===================================================================================>................] 83% - ETA: 0s - loss: 249.826

As it can be seen that not a single epoch was completed by the training process and according to me this is happening only because of the entire dataset not is being read by the optimizer. I am saying this because,

After inspecting the source code of ens::EarlyStopAtMinLoss() I can say that it can only abort the training procedure after checking if the bestObjective is achieved at the end of every epoch and here not a single epoch is being completed.
Similarly std::isnan(overallObjective) || std::isinf(overallObjective) and std::abs(lastObjective - overallObjective) < tolerance i.e. non-availability of overallObjective and tolerance checking can also not abort the training procedure because if we are to inspect the source code of sgd_impl.hpp these conditions only get checked if (currentFunction % numFunctions) == 0 is true and it can only be true if an epoch gets completed which is not the case here.

So, I don't think that the optimizer is reaching convergence to create this issue it is only happening because the optimizer is not parsing the entire dataset. I have tried to solve this issue here (Kindly review it).

This being mentioned I think there are some issues in progress_bar.hpp which needs attention. One of them is the number of epochs not being correctly set as I mentioned above and the ETA value being stuck at zero during the training process. I will open a different issue regarding these. Kindly correct me if I am wrong somewhere. Thanks.

shrit · 2020-03-11T20:44:50Z

@gaurav-singh1998 I do not think that the optimizer has converged in your case, but this is only because you have set optimizer.Tolerance() = -1. So the optimizer will never converge in this case.
I will try to clarify things as I understand. Actually, the objective of this issue is to give us a better output only in the case when the optimizer converges. Therefore, It might be a good idea to increase the optimizer.Tolerance() and to wait until the optimizer converges.
This output will allow us to understand why the training is terminating partway, but since we do not have any output here we can not understand the reason behind this stopping.
Since the number of iteration in your code is less than the number of data points, it is normal for optimization to stop before the end of the epoch. (I am not sure if you have used your pull request mlpack/mlpack#2238 in this case, but surely it should print the warning message)

I am not familiar with progress_bar.hpp. So I do not know if the epochs number is calculated correctly, Also I have noticed several times ETA=0 during the progress of the bar, this might be an issue also. Thank you for bringing it here! 👍

gaurav-singh1998 · 2020-03-11T22:25:39Z

Hi, @shrit thanks for commenting as suggested by you I changed the optimizer.Tolerance() value from -1 to be 120596 and let the optimizer run for three epochs until it got converged this was the output,

Epoch 1/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 208.871
208.982
1875/1875 [====================================================================================================] 100% - 5s 34ms/step - loss: 208.982
Epoch 2/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 1.4697
1.47048
1875/1875 [====================================================================================================] 100% - 7s 36ms/step - loss: 1.47048
Epoch 3/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 0.52357
0.523849
1875/1875 [====================================================================================================] 100% - 4s 34ms/step - loss: 0.523849

As it can be seen that there is no message that states the optimizer has converged which should've got printed according to this but when I changed Info to std::cout I got the expected result i.e.

Epoch 1/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 208.871
208.982
1875/1875 [====================================================================================================] 100% - 6s 35ms/step - loss: 208.982
Epoch 2/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 1.4697
1.47048
1875/1875 [====================================================================================================] 100% - 5s 34ms/step - loss: 1.47048
Epoch 3/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 0.52357
0.523849
1875/1875 [====================================================================================================] 100% - 6s 35ms/step - loss: 0.523849
SGD: minimized within tolerance 120596; terminating optimization.

This is weird because I think that the error lies in ens::Info. @rcurtin @zoq what do you think about this?

(I am not sure if you have used your pull request mlpack/mlpack#2238 in this case, but surely it should print the warning message)

Yes I used it and the warning message gets printed as expected. Kindly let me know if I am wrong somewhere. Thanks.

shrit · 2020-03-11T22:47:58Z

@gaurav-singh1998 I do not remember if ens::info macros are ON by default in cmake. If not maybe you need to add option(ENS_PRINT_INFO ON) in your CMakeList file

gaurav-singh1998 · 2020-03-13T19:04:23Z

Hi, @shrit sorry to respond late but when I made the changes in this line to make the ENS_PRINT_INFO defined as suggested by @zoq on Gitter I got a more enriched result i.e.

Maximum number of iterations not defined (no limit), no progress bar shown.
Epoch 1/0
208.98275 [====================================================================================================] 100% - ETA: 0s - loss: 208.871
1875/1875 [====================================================================================================] 100% - 1s 32ms/step - loss: 208.982
SGD: iteration 120000, objective 2.50779e+07.
Epoch 2/0
1.4704875 [====================================================================================================] 100% - ETA: 0s - loss: 1.4697
1875/1875 [====================================================================================================] 100% - 1s 32ms/step - loss: 1.47048
SGD: iteration 240000, objective 176458.
Epoch 3/0
0.5238495 [====================================================================================================] 100% - ETA: 0s - loss: 0.52357
1875/1875 [====================================================================================================] 100% - 1s 32ms/step - loss: 0.523849
SGD: iteration 360000, objective 62861.9.
SGD: minimized within tolerance 120596; terminating optimization.

Although the first line, Maximum number of iterations not defined (no limit), no progress bar shown., shouldn't be printed according to me but other than that everything looks fine to me. What do you think about this?

gaurav-singh1998 · 2020-03-15T18:52:22Z

Hi, @shrit kindly have a look at #180. Thanks.

rcurtin added help wanted good first issue c: optimizers s: keep open t: feature request labels Dec 24, 2019

rcurtin mentioned this issue Dec 24, 2019

Epoch number does not increase when training ann mlpack/mlpack#2073

Closed

gaurav-singh1998 mentioned this issue Mar 15, 2020

Better output if the optimization process gets terminated during the execution of an epoch. #180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better output when convergence check succeeds #150

Better output when convergence check succeeds #150

rcurtin commented Dec 24, 2019

vikram2000b commented Jan 10, 2020

rcurtin commented Jan 10, 2020

dB2510 commented Jan 26, 2020

zoq commented Jan 26, 2020

dB2510 commented Jan 28, 2020

rcurtin commented Jan 30, 2020

gaurav-singh1998 commented Feb 23, 2020

rcurtin commented Feb 23, 2020

gaurav-singh1998 commented Mar 8, 2020

gaurav-singh1998 commented Mar 8, 2020

rcurtin commented Mar 9, 2020

gaurav-singh1998 commented Mar 9, 2020

gaurav-singh1998 commented Mar 10, 2020 •

edited

Loading

rcurtin commented Mar 11, 2020

gaurav-singh1998 commented Mar 11, 2020

shrit commented Mar 11, 2020

gaurav-singh1998 commented Mar 11, 2020 •

edited

Loading

shrit commented Mar 11, 2020

gaurav-singh1998 commented Mar 13, 2020 •

edited

Loading

gaurav-singh1998 commented Mar 15, 2020

Better output when convergence check succeeds #150

Better output when convergence check succeeds #150

Comments

rcurtin commented Dec 24, 2019

Issue description

Expected behavior

Actual behavior

vikram2000b commented Jan 10, 2020

rcurtin commented Jan 10, 2020

dB2510 commented Jan 26, 2020

zoq commented Jan 26, 2020

dB2510 commented Jan 28, 2020

rcurtin commented Jan 30, 2020

gaurav-singh1998 commented Feb 23, 2020

rcurtin commented Feb 23, 2020

gaurav-singh1998 commented Mar 8, 2020

gaurav-singh1998 commented Mar 8, 2020

rcurtin commented Mar 9, 2020

gaurav-singh1998 commented Mar 9, 2020

gaurav-singh1998 commented Mar 10, 2020 • edited Loading

rcurtin commented Mar 11, 2020

gaurav-singh1998 commented Mar 11, 2020

shrit commented Mar 11, 2020

gaurav-singh1998 commented Mar 11, 2020 • edited Loading

shrit commented Mar 11, 2020

gaurav-singh1998 commented Mar 13, 2020 • edited Loading

gaurav-singh1998 commented Mar 15, 2020

gaurav-singh1998 commented Mar 10, 2020 •

edited

Loading

gaurav-singh1998 commented Mar 11, 2020 •

edited

Loading

gaurav-singh1998 commented Mar 13, 2020 •

edited

Loading