-
Notifications
You must be signed in to change notification settings - Fork 370
Introducing a repair mechanism for corruptions in the ledger database #1174
Introducing a repair mechanism for corruptions in the ledger database #1174
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, my comments are mostly about readability.
src/main/java/com/iota/iri/service/milestone/impl/MilestoneServiceImpl.java
Outdated
Show resolved
Hide resolved
src/main/java/com/iota/iri/service/milestone/impl/LatestSolidMilestoneTrackerImpl.java
Outdated
Show resolved
Hide resolved
src/main/java/com/iota/iri/service/milestone/impl/LatestSolidMilestoneTrackerImpl.java
Outdated
Show resolved
Hide resolved
src/main/java/com/iota/iri/service/milestone/impl/LatestSolidMilestoneTrackerImpl.java
Outdated
Show resolved
Hide resolved
src/main/java/com/iota/iri/service/milestone/impl/LatestSolidMilestoneTrackerImpl.java
Outdated
Show resolved
Hide resolved
src/main/java/com/iota/iri/service/milestone/impl/MilestoneServiceImpl.java
Outdated
Show resolved
Hide resolved
src/main/java/com/iota/iri/service/milestone/impl/MilestoneServiceImpl.java
Show resolved
Hide resolved
src/main/java/com/iota/iri/service/milestone/impl/MilestoneServiceImpl.java
Show resolved
Hide resolved
src/main/java/com/iota/iri/service/milestone/MilestoneService.java
Outdated
Show resolved
Hide resolved
…viceImpl.java Co-Authored-By: hmoog <hm@mkjc.net>
…development/iri into merge_inconsistentledgerfix
…ilestoneTrackerImpl.java Co-Authored-By: hmoog <hm@mkjc.net>
…viceImpl.java Co-Authored-By: hmoog <hm@mkjc.net>
…java Co-Authored-By: hmoog <hm@mkjc.net>
…development/iri into merge_inconsistentledgerfix
for (int i = errorCausingMilestone.index(); i > errorCausingMilestone.index() - repairBackoffCounter; i--) { | ||
milestoneService.resetCorruptedMilestone(i); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to be clear on this:
Let say indexes 100 and 101 are corrupt.
First time you go into this method you reset milestone 100.
Second time you reset both milestones 100 and 101?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets say 100 is corrupt because it didnt get its snapshotIndex set correctly and 101 also approves some of the transactions that were taken into account for the balances of 100 already. 101 can therefore not be applied.
We will first reset 101 and try to reapply it - if that fails we reset 101 and 100 and try to reapply both. If that fails we try to reapply 101, 100 and 99 and try to reapply the three of them.
Depending on which transaction didnt get its snapshotIndex correctly set, we might need to go a few milestones back to "find" the one that wasnt processed correctly. sometimes milestones 101 would reference a "broken" tx from milestone 97 for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to current problems with buildkite, we see a fail even though the build passes current regression tests.
Description
This PR implements a repair mechanism for minor corruptions in the IRI database that causes conflicting milestones to be reverted and re-processed without having to reset the whole database.
(In addition it removes some unused imports)
Detailed Description
If IRI faces database corruptions in the "snapshotIndex" of transactions, it can happen that the same transaction gets processed as being confirmed by two or more milestones and therefore booking its balance multiple times leading to inconsistent balances. This causes the nodes to fall out of sync and report "Skipping negative value for address: ..." in an endless loop. The only way to recover from this problem is to do a --rescan or sometimes even remove the database and start a complete resync.
There are numerous reasons why these corruptions in the snapshotIndex can appear:
Type of change
Checklist: