Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pose2SLAMExample_g2o and Pose3SLAMExample_g2o, the GN solver cannot proceed on some common benchmark datasets #846

Closed
yiping opened this issue Aug 15, 2021 · 14 comments
Labels
bug Bug report
Milestone

Comments

@yiping
Copy link

yiping commented Aug 15, 2021

Description

The shipped pose SLAM examples (both the 2D version and the 3D version) seem to have trouble with some of the commonly used benchmark datasets that are typically considered "easy to solve". Is this expected? I understand these are all local solvers so sometimes they get stuck at local minima, but the solver quits after the 1st iteration due to error increased? Feels like something is wrong with the line search algorithm?

Also, should the example consider using the LM solver or other flavor of trust region algorithms by default?

Two example datasets can be found on Dr. Luca Carlone's website.

input_INTEL_g2o.g2o (SE2 poses)

sphere_bignoise_vertex3.g2o (SE3 poses)

Steps to reproduce

Pose2SLAMExample_g2o by default uses Gauss Newton solver and when tested with INTEL dataset, it gets stuck at the first iteration due to increased error. If I switch to the LM solver in the example, it seems to converge after 42 iterations but the result is garbage. The LM solver itself might be working just fine. I guess the poor result is very likely due to the wild information matrices in this g2o file (values ranging from the magnitude of 10^2 to magnitude of 10^12). The information matrices may cause poorly conditioned hessian, thus bad optimization result. But I would like to hear some insights from other people.

Another widely used dataset, input_M3500_g2o.g2o, works just fine with this example using the default GN solver.

% ./Pose2SLAMExample_g2o input_INTEL_g2o.g2o intel_result.g2o
Adding prior on pose 0 
Optimizing the factor graph
Warning:  stopping nonlinear iterations because error increased
errorThreshold: 1.54678e+09 <? 0
absoluteDecrease: -1544205066.78 <? 1e-05
relativeDecrease: -599.723772745 <? 1e-05
iterations: 1 >? 100
Optimization complete
initial error=2574860.52239
final error=1546779927.31
Writing results to file: intel_result.g2o
done! 

Pose3SLAMExample_g2o also gets stuck at the first iteration with the sphere_bignoise_vertex3.g2o dataset (using GN solver). Note this one is not the sphere2500 dataset (similar, but with a lot more connections, so is a bit more difficult).

% ./Pose3SLAMExample_g2o sphere_bignoise_vertex3.g2o sphere_result.g2o
Adding prior to g2o file 
Optimizing the factor graph
Warning:  stopping nonlinear iterations because error increased
errorThreshold: 8.96714e+08 <? 0
absoluteDecrease: -681084456.774 <? 1e-05
relativeDecrease: -3.15858546399 <? 1e-05
iterations: 1 >? 100
Optimization complete
initial error=215629579.93
final error=896714036.704
Writing results to file: sphere_result.g2o
done! 

Expectation

This may not be a bug - I just want to better understand the situation when the examples don't quite work on some commonly seen datasets - is it expected due to the solver limitation, or something is not set up/configured correctly, or it's a bug?

Environment

Latest GTSAM version compiled on Debian 10

@ProfFan
Copy link
Collaborator

ProfFan commented Aug 15, 2021

Thanks for the report!

I think you could look at the Chordal initialization papers (by Dr. Carlone), and #248, and borglab/SwiftFusion#42. I am pretty sure this is caused by bad initialization, but it would be great if you can compare the result with other solvers so we can know whether there is a problem :)

@lucacarlone
Copy link
Contributor

@yiping : also worth trying to add "#define SLOW_BUT_CORRECT_BETWEENFACTOR" at the beginning of the file. That would use the correct jacobians for the between factors that should improve convergence. Let me know if this helps!

@varunagrawal
Copy link
Collaborator

@yiping please ask questions related to functionality on the Google Group. It should have been linked in the issue template.
Closing this since it doesn't seem to be an issue in GTSAM.

@lucacarlone
Copy link
Contributor

@varunagrawal : this might be a bug since the dataset is supposed to converge easily. I suggest reopening and doing some investigation. I'll be happy to help post ICRA.

@varunagrawal varunagrawal reopened this Aug 21, 2021
@varunagrawal varunagrawal added the bug Bug report label Sep 1, 2021
@lucacarlone
Copy link
Contributor

@yiping @varunagrawal : I carefully looked into this and everything looks good. The Jacobians for the Pose2 case match exactly my manual implementation in Matlab and converge to the same solution. The only issue is that with GN the cost increases after the first iteration on the intel dataset. This triggers GTSAM to terminate early. If that stopping condition is disabled, everything converges as expected.

To verify what I'm saying, try setting:

  params.setRelativeErrorTol(-1e+20);
  params.setAbsoluteErrorTol(-1e+20);
  params.setMaxIterations(10);

in the example. Then the optimization for the Intel dataset converges to a nice and small cost (~107) after 4-5 iterations. (The downside is that if the error tolerances are disabled, then GTSAM always performs the maximum number of iterations, but I guess that's a different issue).

For the Pose3 case, I'm already convinced that the Jacobians are correct as long as "SLOW_BUT_CORRECT_BETWEENFACTOR" is defined when including the between factors.

At this point, if @yiping is ok, I suggest closing this issue.

@varunagrawal
Copy link
Collaborator

Awesome! Thanks @lucacarlone

@varunagrawal
Copy link
Collaborator

On that note, we should consider making progress on #88

@yiping
Copy link
Author

yiping commented Sep 21, 2021

@lucacarlone Thanks a lot for the investigation! (And apology for the delayed response) I confirm the result for input_INTEL_g2o dataset using GN solver (after applying the change you proposed) on a Debian 10 machine.

(GaussNewtonOptimizer)

Adding prior on pose 0 
Optimizing the factor graph
iterations: 30 >? 30
Terminating because reached maximum iterations
Optimization complete
initial error=2.57486e+06
final error=107.927

However, if you switch to the Levenberg-Marquardt optimizer (LevenbergMarquardtOptimizer), the solver won't converge.

(LevenbergMarquardtOptimizer)

Adding prior on pose 0 
Optimizing the factor graph
iterations: 30 >? 30
Terminating because reached maximum iterations
Optimization complete
initial error=2.57486e+06
final error=412341

I also tried the same dataset with the Ceres pose_graph_2d example (The example uses Levenberg-Marquardt solver, which is default in Ceres, I don't think Ceres has a vanilla version of GN)

./pose_graph_2d -input input_INTEL_g2o.g2o

The solver also does not converge after 100 iterations (see details below), and the optimized poses are off. The residual cost formulation in Ceres 2d pose graph example seems pretty-straightforward and correct, so I am guessing it's the solver? Do you think this is a case in which GN can perform better than LM?

Solver Summary (v 2.0.0-eigen-(3.3.4)-no_lapack-eigensparse-no_openmp)

                                     Original                  Reduced
Parameter blocks                         3684                     3681
Parameters                               3684                     3681
Residual blocks                          1483                     1483
Residuals                                4449                     4449

Minimizer                        TRUST_REGION

Sparse linear algebra library    EIGEN_SPARSE
Trust region strategy     LEVENBERG_MARQUARDT

                                        Given                     Used
Linear solver          SPARSE_NORMAL_CHOLESKY   SPARSE_NORMAL_CHOLESKY
Threads                                     1                        1
Linear solver ordering              AUTOMATIC                     3681

Cost:
Initial                          3.501161e+06
Final                            1.442424e+04
Change                           3.486737e+06

Minimizer iterations                      101
Successful steps                           82
Unsuccessful steps                         19 

@varunagrawal
Copy link
Collaborator

That shouldn't be the case since depending on the lambda value, LM switches between Gradient Descent and GN. With the right lambda, LM would simply behave like GN.

@lucacarlone
Copy link
Contributor

I think the initialization point here might be tricky: GN fails to decrease the cost from the initial point, but then it is able to recover. LM on the other side might end up taking smaller and smaller steps without making it very far from the initialization.

@ProfFan ProfFan added this to the GTSAM 4.2 milestone Apr 19, 2022
@dellaert
Copy link
Member

I know this has been a long time, but I am looking to close outstanding issues as we converge on 4.2. From the discussion above I still have no idea as to why ceres converges but GTSAM does not. They both just implement GN/LM. Any ideas?

@lucacarlone
Copy link
Contributor

Hi Frank: see my comments above. The issue for the 2D problem is that GTSAM terminates too early (since the stopping condition forces GN to stop if the cost increases) - is that stopping condition is disabled, GN works as expected. In the 3D case, the Jacobians might be the problem and enabling SLOW_BUT_CORRECT_BETWEENFACTOR should solve it.

@dellaert
Copy link
Member

Yeah, I am looking at that PR also.

@varunagrawal
Copy link
Collaborator

At this point, if @yiping is ok, I suggest closing this issue.

Closing this based on @lucacarlone's recommendation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

5 participants