Pose2SLAMExample_g2o and Pose3SLAMExample_g2o, the GN solver cannot proceed on some common benchmark datasets #846

yiping · 2021-08-15T05:18:14Z

Description

The shipped pose SLAM examples (both the 2D version and the 3D version) seem to have trouble with some of the commonly used benchmark datasets that are typically considered "easy to solve". Is this expected? I understand these are all local solvers so sometimes they get stuck at local minima, but the solver quits after the 1st iteration due to error increased? Feels like something is wrong with the line search algorithm?

Also, should the example consider using the LM solver or other flavor of trust region algorithms by default?

Two example datasets can be found on Dr. Luca Carlone's website.

input_INTEL_g2o.g2o (SE2 poses)

sphere_bignoise_vertex3.g2o (SE3 poses)

Steps to reproduce

Pose2SLAMExample_g2o by default uses Gauss Newton solver and when tested with INTEL dataset, it gets stuck at the first iteration due to increased error. If I switch to the LM solver in the example, it seems to converge after 42 iterations but the result is garbage. The LM solver itself might be working just fine. I guess the poor result is very likely due to the wild information matrices in this g2o file (values ranging from the magnitude of 10^2 to magnitude of 10^12). The information matrices may cause poorly conditioned hessian, thus bad optimization result. But I would like to hear some insights from other people.

Another widely used dataset, input_M3500_g2o.g2o, works just fine with this example using the default GN solver.

% ./Pose2SLAMExample_g2o input_INTEL_g2o.g2o intel_result.g2o
Adding prior on pose 0 
Optimizing the factor graph
Warning:  stopping nonlinear iterations because error increased
errorThreshold: 1.54678e+09 <? 0
absoluteDecrease: -1544205066.78 <? 1e-05
relativeDecrease: -599.723772745 <? 1e-05
iterations: 1 >? 100
Optimization complete
initial error=2574860.52239
final error=1546779927.31
Writing results to file: intel_result.g2o
done!

Pose3SLAMExample_g2o also gets stuck at the first iteration with the sphere_bignoise_vertex3.g2o dataset (using GN solver). Note this one is not the sphere2500 dataset (similar, but with a lot more connections, so is a bit more difficult).

% ./Pose3SLAMExample_g2o sphere_bignoise_vertex3.g2o sphere_result.g2o
Adding prior to g2o file 
Optimizing the factor graph
Warning:  stopping nonlinear iterations because error increased
errorThreshold: 8.96714e+08 <? 0
absoluteDecrease: -681084456.774 <? 1e-05
relativeDecrease: -3.15858546399 <? 1e-05
iterations: 1 >? 100
Optimization complete
initial error=215629579.93
final error=896714036.704
Writing results to file: sphere_result.g2o
done!

Expectation

This may not be a bug - I just want to better understand the situation when the examples don't quite work on some commonly seen datasets - is it expected due to the solver limitation, or something is not set up/configured correctly, or it's a bug?

Environment

Latest GTSAM version compiled on Debian 10

The text was updated successfully, but these errors were encountered:

ProfFan · 2021-08-15T19:05:38Z

Thanks for the report!

I think you could look at the Chordal initialization papers (by Dr. Carlone), and #248, and borglab/SwiftFusion#42. I am pretty sure this is caused by bad initialization, but it would be great if you can compare the result with other solvers so we can know whether there is a problem :)

lucacarlone · 2021-08-17T12:50:10Z

@yiping : also worth trying to add "#define SLOW_BUT_CORRECT_BETWEENFACTOR" at the beginning of the file. That would use the correct jacobians for the between factors that should improve convergence. Let me know if this helps!

varunagrawal · 2021-08-20T05:03:50Z

@yiping please ask questions related to functionality on the Google Group. It should have been linked in the issue template.
Closing this since it doesn't seem to be an issue in GTSAM.

lucacarlone · 2021-08-21T14:31:59Z

@varunagrawal : this might be a bug since the dataset is supposed to converge easily. I suggest reopening and doing some investigation. I'll be happy to help post ICRA.

lucacarlone · 2021-09-19T04:51:34Z

@yiping @varunagrawal : I carefully looked into this and everything looks good. The Jacobians for the Pose2 case match exactly my manual implementation in Matlab and converge to the same solution. The only issue is that with GN the cost increases after the first iteration on the intel dataset. This triggers GTSAM to terminate early. If that stopping condition is disabled, everything converges as expected.

To verify what I'm saying, try setting:

  params.setRelativeErrorTol(-1e+20);
  params.setAbsoluteErrorTol(-1e+20);
  params.setMaxIterations(10);

in the example. Then the optimization for the Intel dataset converges to a nice and small cost (~107) after 4-5 iterations. (The downside is that if the error tolerances are disabled, then GTSAM always performs the maximum number of iterations, but I guess that's a different issue).

For the Pose3 case, I'm already convinced that the Jacobians are correct as long as "SLOW_BUT_CORRECT_BETWEENFACTOR" is defined when including the between factors.

At this point, if @yiping is ok, I suggest closing this issue.

varunagrawal · 2021-09-20T23:21:58Z

Awesome! Thanks @lucacarlone

varunagrawal · 2021-09-20T23:23:22Z

On that note, we should consider making progress on #88

yiping · 2021-09-21T03:58:31Z

@lucacarlone Thanks a lot for the investigation! (And apology for the delayed response) I confirm the result for input_INTEL_g2o dataset using GN solver (after applying the change you proposed) on a Debian 10 machine.

(GaussNewtonOptimizer)

Adding prior on pose 0 
Optimizing the factor graph
iterations: 30 >? 30
Terminating because reached maximum iterations
Optimization complete
initial error=2.57486e+06
final error=107.927

However, if you switch to the Levenberg-Marquardt optimizer (LevenbergMarquardtOptimizer), the solver won't converge.

(LevenbergMarquardtOptimizer)

Adding prior on pose 0 
Optimizing the factor graph
iterations: 30 >? 30
Terminating because reached maximum iterations
Optimization complete
initial error=2.57486e+06
final error=412341

I also tried the same dataset with the Ceres pose_graph_2d example (The example uses Levenberg-Marquardt solver, which is default in Ceres, I don't think Ceres has a vanilla version of GN)

./pose_graph_2d -input input_INTEL_g2o.g2o

The solver also does not converge after 100 iterations (see details below), and the optimized poses are off. The residual cost formulation in Ceres 2d pose graph example seems pretty-straightforward and correct, so I am guessing it's the solver? Do you think this is a case in which GN can perform better than LM?

Solver Summary (v 2.0.0-eigen-(3.3.4)-no_lapack-eigensparse-no_openmp)

                                     Original                  Reduced
Parameter blocks                         3684                     3681
Parameters                               3684                     3681
Residual blocks                          1483                     1483
Residuals                                4449                     4449

Minimizer                        TRUST_REGION

Sparse linear algebra library    EIGEN_SPARSE
Trust region strategy     LEVENBERG_MARQUARDT

                                        Given                     Used
Linear solver          SPARSE_NORMAL_CHOLESKY   SPARSE_NORMAL_CHOLESKY
Threads                                     1                        1
Linear solver ordering              AUTOMATIC                     3681

Cost:
Initial                          3.501161e+06
Final                            1.442424e+04
Change                           3.486737e+06

Minimizer iterations                      101
Successful steps                           82
Unsuccessful steps                         19

varunagrawal · 2021-09-21T09:40:23Z

That shouldn't be the case since depending on the lambda value, LM switches between Gradient Descent and GN. With the right lambda, LM would simply behave like GN.

lucacarlone · 2021-09-21T19:35:08Z

I think the initialization point here might be tricky: GN fails to decrease the cost from the initial point, but then it is able to recover. LM on the other side might end up taking smaller and smaller steps without making it very far from the initialization.

dellaert · 2022-12-23T14:06:36Z

I know this has been a long time, but I am looking to close outstanding issues as we converge on 4.2. From the discussion above I still have no idea as to why ceres converges but GTSAM does not. They both just implement GN/LM. Any ideas?

lucacarlone · 2022-12-23T19:59:54Z

Hi Frank: see my comments above. The issue for the 2D problem is that GTSAM terminates too early (since the stopping condition forces GN to stop if the cost increases) - is that stopping condition is disabled, GN works as expected. In the 3D case, the Jacobians might be the problem and enabling SLOW_BUT_CORRECT_BETWEENFACTOR should solve it.

dellaert · 2022-12-23T20:33:20Z

Yeah, I am looking at that PR also.

varunagrawal · 2023-12-30T22:38:10Z

At this point, if @yiping is ok, I suggest closing this issue.

Closing this based on @lucacarlone's recommendation.

varunagrawal closed this as completed Aug 20, 2021

varunagrawal reopened this Aug 21, 2021

varunagrawal added the bug Bug report label Sep 1, 2021

ProfFan mentioned this issue Apr 19, 2022

Drafting 4.2 Release #1174

Closed

ProfFan added this to the GTSAM 4.2 milestone Apr 19, 2022

EmilioOlivastri mentioned this issue Feb 8, 2023

Bad optimization on INTEL/MIT datasets MIT-SPARK/Kimera-RPGO#87

Closed

varunagrawal closed this as completed Dec 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pose2SLAMExample_g2o and Pose3SLAMExample_g2o, the GN solver cannot proceed on some common benchmark datasets #846

Pose2SLAMExample_g2o and Pose3SLAMExample_g2o, the GN solver cannot proceed on some common benchmark datasets #846

yiping commented Aug 15, 2021

ProfFan commented Aug 15, 2021

lucacarlone commented Aug 17, 2021

varunagrawal commented Aug 20, 2021

lucacarlone commented Aug 21, 2021

lucacarlone commented Sep 19, 2021

varunagrawal commented Sep 20, 2021

varunagrawal commented Sep 20, 2021

yiping commented Sep 21, 2021 •

edited

Loading

varunagrawal commented Sep 21, 2021

lucacarlone commented Sep 21, 2021

dellaert commented Dec 23, 2022

lucacarlone commented Dec 23, 2022

dellaert commented Dec 23, 2022

varunagrawal commented Dec 30, 2023

Pose2SLAMExample_g2o and Pose3SLAMExample_g2o, the GN solver cannot proceed on some common benchmark datasets #846

Pose2SLAMExample_g2o and Pose3SLAMExample_g2o, the GN solver cannot proceed on some common benchmark datasets #846

Comments

yiping commented Aug 15, 2021

Description

Steps to reproduce

Expectation

Environment

ProfFan commented Aug 15, 2021

lucacarlone commented Aug 17, 2021

varunagrawal commented Aug 20, 2021

lucacarlone commented Aug 21, 2021

lucacarlone commented Sep 19, 2021

varunagrawal commented Sep 20, 2021

varunagrawal commented Sep 20, 2021

yiping commented Sep 21, 2021 • edited Loading

varunagrawal commented Sep 21, 2021

lucacarlone commented Sep 21, 2021

dellaert commented Dec 23, 2022

lucacarlone commented Dec 23, 2022

dellaert commented Dec 23, 2022

varunagrawal commented Dec 30, 2023

yiping commented Sep 21, 2021 •

edited

Loading