Introducing a repair mechanism for corruptions in the ledger database #1174

hmoog · 2018-11-20T15:17:14Z

Description

This PR implements a repair mechanism for minor corruptions in the IRI database that causes conflicting milestones to be reverted and re-processed without having to reset the whole database.

(In addition it removes some unused imports)

Detailed Description

If IRI faces database corruptions in the "snapshotIndex" of transactions, it can happen that the same transaction gets processed as being confirmed by two or more milestones and therefore booking its balance multiple times leading to inconsistent balances. This causes the nodes to fall out of sync and report "Skipping negative value for address: ..." in an endless loop. The only way to recover from this problem is to do a --rescan or sometimes even remove the database and start a complete resync.

There are numerous reasons why these corruptions in the snapshotIndex can appear:

milestones got processed in the wrong order (fixed already but existing databases might still have these corruptions)
IRI crashes or gets stopped before the modified snapshotIndex was flushed to the database.
Race conditions between different threads that try to write to the same transaction at the same time (for example solid = true + snapshotIndex = xyz) and therefore overwriting the changes of the other thread. Note: Updating just a single property of the transaction causes the whole transaction to be serialized and written again. (THIS HAPPENS ALOT DURING TIMES OF HIGH LOADS / NETWORK ACTIVITY)

Type of change

Enhancement (a non-breaking change which adds functionality)

Checklist:

My code follows the style guidelines for this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
New and existing unit tests pass locally with my changes

alon-e

looks good, my comments are mostly about readability.

src/main/java/com/iota/iri/service/milestone/impl/MilestoneServiceImpl.java

src/main/java/com/iota/iri/service/milestone/impl/LatestSolidMilestoneTrackerImpl.java

src/main/java/com/iota/iri/service/milestone/impl/MilestoneServiceImpl.java

src/main/java/com/iota/iri/service/milestone/MilestoneService.java

…viceImpl.java Co-Authored-By: hmoog <hm@mkjc.net>

…development/iri into merge_inconsistentledgerfix

…ilestoneTrackerImpl.java Co-Authored-By: hmoog <hm@mkjc.net>

…viceImpl.java Co-Authored-By: hmoog <hm@mkjc.net>

…java Co-Authored-By: hmoog <hm@mkjc.net>

…development/iri into merge_inconsistentledgerfix

GalRogozinski · 2018-11-22T10:45:57Z

src/main/java/com/iota/iri/service/milestone/impl/LatestSolidMilestoneTrackerImpl.java

+        for (int i = errorCausingMilestone.index(); i > errorCausingMilestone.index() - repairBackoffCounter; i--) {
+            milestoneService.resetCorruptedMilestone(i);
+        }
+    }


I just want to be clear on this:
Let say indexes 100 and 101 are corrupt.
First time you go into this method you reset milestone 100.
Second time you reset both milestones 100 and 101?

lets say 100 is corrupt because it didnt get its snapshotIndex set correctly and 101 also approves some of the transactions that were taken into account for the balances of 100 already. 101 can therefore not be applied.

We will first reset 101 and try to reapply it - if that fails we reset 101 and 100 and try to reapply both. If that fails we try to reapply 101, 100 and 99 and try to reapply the three of them.

Depending on which transaction didnt get its snapshotIndex correctly set, we might need to go a few milestones back to "find" the one that wasnt processed correctly. sometimes milestones 101 would reference a "broken" tx from milestone 97 for example.

GalRogozinski

Due to current problems with buildkite, we see a fail even though the build passes current regression tests.

Feat: introducing a repair mechanism for corrupted ledger states

87a0b2e

GalRogozinski requested review from alon-e and GalRogozinski and removed request for GalRogozinski November 20, 2018 15:18

alon-e suggested changes Nov 21, 2018

View reviewed changes

alon-e and others added 6 commits November 21, 2018 17:26

Update src/main/java/com/iota/iri/service/milestone/impl/MilestoneSer…

0dc7420

…viceImpl.java Co-Authored-By: hmoog <hm@mkjc.net>

Refactor: refactored some utility methods to increase readability

451edca

Merge branch 'merge_inconsistentledgerfix' of https://github.com/iota…

7ed0f93

…development/iri into merge_inconsistentledgerfix

Update src/main/java/com/iota/iri/service/milestone/impl/LatestSolidM…

507f8a6

…ilestoneTrackerImpl.java Co-Authored-By: hmoog <hm@mkjc.net>

Update src/main/java/com/iota/iri/service/milestone/impl/MilestoneSer…

e131a05

…viceImpl.java Co-Authored-By: hmoog <hm@mkjc.net>

Update src/main/java/com/iota/iri/service/milestone/MilestoneService.…

309ec58

…java Co-Authored-By: hmoog <hm@mkjc.net>

alon-e approved these changes Nov 21, 2018

View reviewed changes

Hans Moog added 2 commits November 22, 2018 04:37

Refactor: removed unused parameter

b6882c6

Merge branch 'merge_inconsistentledgerfix' of https://github.com/iota…

ba98aaa

…development/iri into merge_inconsistentledgerfix

iotaledger deleted a comment Nov 22, 2018

GalRogozinski reviewed Nov 22, 2018

View reviewed changes

Merge branch 'dev-localsnapshots' into merge_inconsistentledgerfix

ac73dcc

GalRogozinski mentioned this pull request Nov 22, 2018

Race condition while updating transactions #1187

Closed

GalRogozinski approved these changes Nov 25, 2018

View reviewed changes

GalRogozinski merged commit 506e074 into iotaledger:dev-localsnapshots Nov 25, 2018

jakubcech mentioned this pull request Nov 27, 2018

IRI enters into a continuous loop when negative balance is encountered #827

Open

jakubcech mentioned this pull request Apr 29, 2019

Milestone repair mechanism regression test #1423

Open

oracle58 mentioned this pull request May 8, 2019

Use of helix.api/value-transactions throws node into infinite loop when making value txs HelixNetwork/pendulum#34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing a repair mechanism for corruptions in the ledger database #1174

Introducing a repair mechanism for corruptions in the ledger database #1174

hmoog commented Nov 20, 2018 •

edited by GalRogozinski

Loading

alon-e left a comment

GalRogozinski Nov 22, 2018

hmoog Nov 22, 2018

GalRogozinski left a comment

Introducing a repair mechanism for corruptions in the ledger database #1174

Introducing a repair mechanism for corruptions in the ledger database #1174

Conversation

hmoog commented Nov 20, 2018 • edited by GalRogozinski Loading

Description

Detailed Description

Type of change

Checklist:

alon-e left a comment

Choose a reason for hiding this comment

GalRogozinski Nov 22, 2018

Choose a reason for hiding this comment

hmoog Nov 22, 2018

Choose a reason for hiding this comment

GalRogozinski left a comment

Choose a reason for hiding this comment

hmoog commented Nov 20, 2018 •

edited by GalRogozinski

Loading