Skip to content
This repository has been archived by the owner on Aug 23, 2020. It is now read-only.

Introducing a repair mechanism for corruptions in the ledger database #1174

Conversation

hmoog
Copy link

@hmoog hmoog commented Nov 20, 2018

Description

This PR implements a repair mechanism for minor corruptions in the IRI database that causes conflicting milestones to be reverted and re-processed without having to reset the whole database.

(In addition it removes some unused imports)

Detailed Description

If IRI faces database corruptions in the "snapshotIndex" of transactions, it can happen that the same transaction gets processed as being confirmed by two or more milestones and therefore booking its balance multiple times leading to inconsistent balances. This causes the nodes to fall out of sync and report "Skipping negative value for address: ..." in an endless loop. The only way to recover from this problem is to do a --rescan or sometimes even remove the database and start a complete resync.

There are numerous reasons why these corruptions in the snapshotIndex can appear:

  • milestones got processed in the wrong order (fixed already but existing databases might still have these corruptions)
  • IRI crashes or gets stopped before the modified snapshotIndex was flushed to the database.
  • Race conditions between different threads that try to write to the same transaction at the same time (for example solid = true + snapshotIndex = xyz) and therefore overwriting the changes of the other thread. Note: Updating just a single property of the transaction causes the whole transaction to be serialized and written again. (THIS HAPPENS ALOT DURING TIMES OF HIGH LOADS / NETWORK ACTIVITY)

Type of change

  • Enhancement (a non-breaking change which adds functionality)

Checklist:

  • My code follows the style guidelines for this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • New and existing unit tests pass locally with my changes

@GalRogozinski GalRogozinski requested review from alon-e and GalRogozinski and removed request for GalRogozinski November 20, 2018 15:18
Copy link
Contributor

@alon-e alon-e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, my comments are mostly about readability.

@iotaledger iotaledger deleted a comment Nov 22, 2018
for (int i = errorCausingMilestone.index(); i > errorCausingMilestone.index() - repairBackoffCounter; i--) {
milestoneService.resetCorruptedMilestone(i);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to be clear on this:
Let say indexes 100 and 101 are corrupt.
First time you go into this method you reset milestone 100.
Second time you reset both milestones 100 and 101?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets say 100 is corrupt because it didnt get its snapshotIndex set correctly and 101 also approves some of the transactions that were taken into account for the balances of 100 already. 101 can therefore not be applied.

We will first reset 101 and try to reapply it - if that fails we reset 101 and 100 and try to reapply both. If that fails we try to reapply 101, 100 and 99 and try to reapply the three of them.

Depending on which transaction didnt get its snapshotIndex correctly set, we might need to go a few milestones back to "find" the one that wasnt processed correctly. sometimes milestones 101 would reference a "broken" tx from milestone 97 for example.

Copy link
Contributor

@GalRogozinski GalRogozinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to current problems with buildkite, we see a fail even though the build passes current regression tests.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants