Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better output when convergence check succeeds #150

Open
rcurtin opened this issue Dec 24, 2019 · 20 comments
Open

Better output when convergence check succeeds #150

rcurtin opened this issue Dec 24, 2019 · 20 comments

Comments

@rcurtin
Copy link
Member

rcurtin commented Dec 24, 2019

Issue description

When using callbacks like ens::ProgessBar or similar, and the optimization terminates partway through an epoch due to a convergence check, the output can be confusing:

Epoch 1/29[=========================================================================================>..........] 89% - ETA: 0s - loss: 1115.75

but then output stops.

See the discussion at mlpack/mlpack#2073 for many more details and how to reproduce.

Expected behavior

We should see if we can modify the ens::ProgressBar callback (and perhaps others) to give better output in these situations. Perhaps something like this would be an improvement:

Epoch 1/29[=========================================================================================>..........] 89% - ETA: 0s - loss: 1115.75
Optimization converged! (value 1.33e-6, tolerance 1e-5)

Actual behavior

Instead the output just terminates:

Epoch 1/29[=========================================================================================>..........] 89% - ETA: 0s - loss: 1115.75

This would be a good issue if you are not familiar with how ensmallen's callbacks work and would like to be. I don't have a direct route to a solution here, so my suggestion would be to investigate the current callback code, reproduce the issue, then think about the cleanest way to print convergence information in the ens::ProgressBar callback. Once that's done, see if similar changes might be useful for some of the other callbacks that print information (see http://ensmallen.org/docs.html#callback-documentation for more information on the callbacks that are available).

@vikram2000b
Copy link

I would like to take up this issue.

@rcurtin
Copy link
Member Author

rcurtin commented Jan 10, 2020

Hi @vikram2000b, you're more than welcome to. When you have a PR ready, someone will review it when they're able to. 👍

@dB2510
Copy link

dB2510 commented Jan 26, 2020

Can I take up this issue as there is no PR yet?
I am a beginner and I want to explore how this works.

@zoq
Copy link
Member

zoq commented Jan 26, 2020

@dB2510 I think that is reasonable, @vikram2000b are you fine with that?

@dB2510
Copy link

dB2510 commented Jan 28, 2020

@zoq Any idea how can I reproduce the issue mentioned here mlpack/mlpack#2073 ?

@rcurtin
Copy link
Member Author

rcurtin commented Jan 30, 2020

@dB2510 there is some code in that issue that you can directly compile and use in a program to reproduce the issue. You will probably have to find some datasets to use to reproduce the issue, but that shouldn't be hard---you can probably use some of the datasets in src/mlpack/tests/data/ or something like that.

@gaurav-singh1998
Copy link
Contributor

Hey, @rcurtin wanted to know if this has been solved if not then can I move forward with it?

@rcurtin
Copy link
Member Author

rcurtin commented Feb 23, 2020

@gaurav-singh1998 I have yet to see any PR related to the issue, so yes, if you can solve the issue, please do. 👍

@gaurav-singh1998
Copy link
Contributor

Hi, @rcurtin @zoq while I was through the callback codebase trying to understand it for solving this issue I came across the piece of code given below which iterates over the callbacks and invoke the BeginOptimization() callback if it exists

bool result = false;
    (void)std::initializer_list<bool>{ result =
        result || Callback::BeginOptimizationFunction(callbacks, optimizer,
            function, coordinates)... };
     return result;

I am having a hard time understanding how does this work. Please help understand the underlying concepts behind this. Thanks.

@gaurav-singh1998
Copy link
Contributor

@rcurtin @zoq one more thing, sorry if I irritate you but your help would be very helpful for moving forward with this issue. According to the discussion which happened here, the optimization terminates partway because of the overallObjective value reaching nan or inf or because of std::abs(lastObjective - overallObjective) < tolerance or because of EarlyStopAtMinLoss. This information has to be incorporated in ens::ProgressBar. Am I right? Once again I am sorry to disturb you with these trivial doubts. Thanks

@rcurtin
Copy link
Member Author

rcurtin commented Mar 9, 2020

Hey @gaurav-singh1998, the bit of code in BeginOptimization() terminates the optimization after a callback returns false. We actually just submitted a paper to arXiv today that describes the details; you can find the sources in https://www.github.com/mlpack/ensmallen_tr/, and it will be up on arXiv soon.

I would suggest finding a way to make it so that ProgressBar can realize when optimization has terminated without actually adding all of the convergence checks to the internals of ProgressBar.

@gaurav-singh1998
Copy link
Contributor

Thanks, @rcurtin for the clarification I will try to solve this issue in the next few days.

@gaurav-singh1998
Copy link
Contributor

gaurav-singh1998 commented Mar 10, 2020

Hi, @rcurtin while I was going through the ProgressBar code I found that the number of epochs is defined as epochs = optimizer.MaxIterations() / epochSize but shouldn't it be epochs = optimizer.MaxIterations() / function.NumFunctions(). I am saying this because if we were to consider, as an example, suppose there are 120000 data points in a dataset and MaxIterations() parameter of the optimizer is defined as 240000 i.e. double the number of data points in the dataset then according to the equation that I stated epochs will turn out to be 2 which should be correct according to me because an epoch is a single pass through the full training set and here we are supposed to take 2 passes through the dataset. Correct me if I am wrong. Thanks.

@rcurtin
Copy link
Member Author

rcurtin commented Mar 11, 2020

epochSize is defined as function.NumFunctions() * optimizer.BatchSize(). Thus effectively epochSize is equivalent to the number of times the StepTaken() callback is called.

It does seem to me like epochs should then be defined to account for the value of optimizer.BatchSize(), so perhaps there is an issue there.

However, I'm not familiar enough with the code to say for sure. Before making any changes, I would suggest first confirming that this is an issue by creating an example where ProgressBar does not behave as expected, and then open a separate issue in this repository to handle it.

@gaurav-singh1998
Copy link
Contributor

Hi, @rcurtin @zoq @shrit according to the discussion that happened here I recreated the issue, the code regarding it can be found here. The dimensions of the dataset I used was [120000, 23] (considering the row-major matrix). After compiling and executing it the output I got was,

Epoch 1/54
1563/1875 [===================================================================================>................] 83% - ETA: 0s - loss: 249.826

As it can be seen that not a single epoch was completed by the training process and according to me this is happening only because of the entire dataset not is being read by the optimizer. I am saying this because,

  • After inspecting the source code of ens::EarlyStopAtMinLoss() I can say that it can only abort the training procedure after checking if the bestObjective is achieved at the end of every epoch and here not a single epoch is being completed.
  • Similarly std::isnan(overallObjective) || std::isinf(overallObjective) and std::abs(lastObjective - overallObjective) < tolerance i.e. non-availability of overallObjective and tolerance checking can also not abort the training procedure because if we are to inspect the source code of sgd_impl.hpp these conditions only get checked if (currentFunction % numFunctions) == 0 is true and it can only be true if an epoch gets completed which is not the case here.

So, I don't think that the optimizer is reaching convergence to create this issue it is only happening because the optimizer is not parsing the entire dataset. I have tried to solve this issue here (Kindly review it).

This being mentioned I think there are some issues in progress_bar.hpp which needs attention. One of them is the number of epochs not being correctly set as I mentioned above and the ETA value being stuck at zero during the training process. I will open a different issue regarding these. Kindly correct me if I am wrong somewhere. Thanks.

@shrit
Copy link
Member

shrit commented Mar 11, 2020

@gaurav-singh1998 I do not think that the optimizer has converged in your case, but this is only because you have set optimizer.Tolerance() = -1. So the optimizer will never converge in this case.
I will try to clarify things as I understand. Actually, the objective of this issue is to give us a better output only in the case when the optimizer converges. Therefore, It might be a good idea to increase the optimizer.Tolerance() and to wait until the optimizer converges.
This output will allow us to understand why the training is terminating partway, but since we do not have any output here we can not understand the reason behind this stopping.
Since the number of iteration in your code is less than the number of data points, it is normal for optimization to stop before the end of the epoch. (I am not sure if you have used your pull request mlpack/mlpack#2238 in this case, but surely it should print the warning message)

I am not familiar with progress_bar.hpp. So I do not know if the epochs number is calculated correctly, Also I have noticed several times ETA=0 during the progress of the bar, this might be an issue also. Thank you for bringing it here! 👍

@gaurav-singh1998
Copy link
Contributor

gaurav-singh1998 commented Mar 11, 2020

Hi, @shrit thanks for commenting as suggested by you I changed the optimizer.Tolerance() value from -1 to be 120596 and let the optimizer run for three epochs until it got converged this was the output,

Epoch 1/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 208.871
208.982
1875/1875 [====================================================================================================] 100% - 5s 34ms/step - loss: 208.982
Epoch 2/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 1.4697
1.47048
1875/1875 [====================================================================================================] 100% - 7s 36ms/step - loss: 1.47048
Epoch 3/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 0.52357
0.523849
1875/1875 [====================================================================================================] 100% - 4s 34ms/step - loss: 0.523849

As it can be seen that there is no message that states the optimizer has converged which should've got printed according to this but when I changed Info to std::cout I got the expected result i.e.

Epoch 1/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 208.871
208.982
1875/1875 [====================================================================================================] 100% - 6s 35ms/step - loss: 208.982
Epoch 2/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 1.4697
1.47048
1875/1875 [====================================================================================================] 100% - 5s 34ms/step - loss: 1.47048
Epoch 3/0
1875/1875 [====================================================================================================] 100% - ETA: 0s - loss: 0.52357
0.523849
1875/1875 [====================================================================================================] 100% - 6s 35ms/step - loss: 0.523849
SGD: minimized within tolerance 120596; terminating optimization.

This is weird because I think that the error lies in ens::Info. @rcurtin @zoq what do you think about this?

(I am not sure if you have used your pull request mlpack/mlpack#2238 in this case, but surely it should print the warning message)

Yes I used it and the warning message gets printed as expected. Kindly let me know if I am wrong somewhere. Thanks.

@shrit
Copy link
Member

shrit commented Mar 11, 2020

@gaurav-singh1998 I do not remember if ens::info macros are ON by default in cmake. If not maybe you need to add option(ENS_PRINT_INFO ON) in your CMakeList file

@gaurav-singh1998
Copy link
Contributor

gaurav-singh1998 commented Mar 13, 2020

Hi, @shrit sorry to respond late but when I made the changes in this line to make the ENS_PRINT_INFO defined as suggested by @zoq on Gitter I got a more enriched result i.e.

Maximum number of iterations not defined (no limit), no progress bar shown.
Epoch 1/0
208.98275 [====================================================================================================] 100% - ETA: 0s - loss: 208.871
1875/1875 [====================================================================================================] 100% - 1s 32ms/step - loss: 208.982
SGD: iteration 120000, objective 2.50779e+07.
Epoch 2/0
1.4704875 [====================================================================================================] 100% - ETA: 0s - loss: 1.4697
1875/1875 [====================================================================================================] 100% - 1s 32ms/step - loss: 1.47048
SGD: iteration 240000, objective 176458.
Epoch 3/0
0.5238495 [====================================================================================================] 100% - ETA: 0s - loss: 0.52357
1875/1875 [====================================================================================================] 100% - 1s 32ms/step - loss: 0.523849
SGD: iteration 360000, objective 62861.9.
SGD: minimized within tolerance 120596; terminating optimization.

Although the first line, Maximum number of iterations not defined (no limit), no progress bar shown., shouldn't be printed according to me but other than that everything looks fine to me. What do you think about this?

@gaurav-singh1998
Copy link
Contributor

Hi, @shrit kindly have a look at #180. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants