Splitstore: proposal for MaxHotBytesTarget and space adaptive GC #10388

ZenGround0 · 2023-03-03T17:58:01Z

Checklist

This is not a new feature or an enhancement to the Filecoin protocol. If it is, please open an FIP issue.
This is not a new feature request. If it is, please file a feature request instead.
This is not brainstorming ideas. If you have an idea you'd like to discuss, please open a new discussion on the lotus forum and select the category as Ideas.
I have a specific, actionable, and well motivated improvement to propose.

Lotus component

Improvement Suggestion

Breakdown of splitstore problems and needs

Here is a useful breakdown from @f8-ptrk categorizing splitstore problems

three things are important for us:
a) we know it's upper boundaries in resource usage
b) we can rely on it not causing getting out of sync
c) we can rely on it to have everything available for sealing/deals

Point c) is about soundness of compaction. There is no evidence of problems with this today.

An attempt at fixing issues with point b) is being addressed in #10387 by limiting worker goroutines running flatten operations. Its possible (likely?) there are still problems here. To the extent this is still a problem the bottleneck is probably contention between GC and block sync reads / writes at the badger level. We need to keep monitoring this issue and if it persists we can revisit designing something that prioritizes block sync over GC. The easier solutions involve coarsely shutting off badger GC / compaction entirely after somehow detecting out of sync. Harder solutions might involve somehow structurally doing compaction access of badger at a lower priority than chain sync access. It's unclear if this is possible.

Failures related to point a) present as disk usage exceeding space on the device and crashing the daemon. This proposal is about making this failure mode much harder to reach

Fixing disk overflows

Measuring splitstore needs

The first thing we can do is make available information about the splitstore's last marked hotstore set so that users can know the absolutely necessary requirements. We can also keep a record of the size of the purged set so users can get a sense for the high water mark. This information can then be used along with the gc command in #10387 to know roughly how much garbage is in the store.

We can get marked set size without increasing compaction load by tweaking the walk functions to measure the bytes of all loaded graphs and return a count. We would need to load sizes of the blocks in the purged graph to get the full picture which would make discard post-walk processing approach universal post-walk processing.

Automating avoidance of a max value

The next thing we can do is use the measurement of the marked graph and a measurement of the existing total badger size to determine if moving GC (overhead current size + marked hot store size) will overload the disk. We can then modify hotstore GC behavior to never cross the boundary by 1) pre-emptively using moving GC when we are getting close to overloading the disk during moving GC 2) if we pass the point where move would overload the disk we can force a more aggressive online GC with a lower threshold.

While we could use the filesystem's remaining space as our target it will likely be friendlier for operators to allow them to specify a new config value MaxHotBytesTarget in the splitstore config.

More ideas along these lines

Other ideas we could pursue later include 1) ramping down the online GC threshold if we are not collecting enough garbage during online GC 2) estimating online GC threshold based on total live block size / total badger datastore size = 1 - g g being the average garbage fraction of badger vlogs. This last idea would benefit from a deeper investigation of badger vlog garbage fraction distributions. They're probably normal but maybe the story is more complicated. With a good understanding of this distribution we could get a good idea of what value to set threshold to in order to delete x percent of garbage + what overhead in time / bytes loaded each threshold value will cause / expected size of GC walk before we hit a vlog with a lower fraction and terminate.

The text was updated successfully, but these errors were encountered:

ZenGround0 added need/triage kind/enhancement Kind: Enhancement labels Mar 3, 2023

This was referenced Mar 5, 2023

feat:splitstore:Configure max space used by hotstore and GC makes best effort to respect #10391

Merged

Spike: Pause compaction when out of sync #10392

Closed

Reiers removed the need/triage label Mar 16, 2023

ZenGround0 closed this as completed Apr 20, 2023

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitstore: proposal for MaxHotBytesTarget and space adaptive GC #10388

Splitstore: proposal for MaxHotBytesTarget and space adaptive GC #10388

ZenGround0 commented Mar 3, 2023 •

edited

Loading

Splitstore: proposal for MaxHotBytesTarget and space adaptive GC #10388

Splitstore: proposal for MaxHotBytesTarget and space adaptive GC #10388

Comments

ZenGround0 commented Mar 3, 2023 • edited Loading

Checklist

Lotus component

Improvement Suggestion

Breakdown of splitstore problems and needs

Fixing disk overflows

Measuring splitstore needs

Automating avoidance of a max value

More ideas along these lines

ZenGround0 commented Mar 3, 2023 •

edited

Loading