Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: fixup fvm IPLD flush logic #1810

Merged
merged 1 commit into from
Jul 10, 2023
Merged

Conversation

Stebalien
Copy link
Member

@Stebalien Stebalien commented Jul 7, 2023

In preparation for reachability analysis (we're going to re-use this same code).

  • Make it less generic (performance).
  • Remove blocks from the write buffer as we write them to avoid duplicate writes.
  • Simplify some of the checks around what is allowed. For example, I'm now allowing CBOR + Identity hash which should have been allowed previously but wasn't (we don't use it but still, it should have been allowed).
  • Remove the explicit 100 byte CID length check. The Cid type already validates that the digest can be no longer than 64 bytes.
  • Be less strict on DagCBOR validation. Counter-intuitively, being overly strict here is dangerous as it gives us more points where implementations can disagree and fork. Instead, we enforce the following rules for DAG_CBOR:
    1. Blocks must have a valid CBOR structure, but values aren't validated. E.g., no utf-8 validation, no float validation, no minimum encoding requirements, no canonical ordering requirements, etc.
    2. All CBOR values tagged with 42 must be valid CIDs. I.e., a CBOR byte string starting with a 0x0 byte followed by a valid CID with at most a 64 byte digest.
  • Remove recursion. We're likely going to implement a max IPLD graph depth anyways, but we might as well be safe. In terms of memory usage, this logic will allocate O(blocks) for blocks already in memory, so there should be no danger of running out of memory.

Working towards https://github.com/filecoin-project/fvm-pm/issues/715

In preparation for reachability analysis (we're going to re-use this
same code).

- Make it less generic (performance).

- Remove blocks from the write buffer as we write them to avoid
  duplicate writes.

- Simplify some of the checks around what is allowed. For example, I'm
  now allowing CBOR + Identity hash which should have been allowed
  previously but wasn't (we don't use it but still, it should have been
  allowed).

- Remove the explicit 100 byte CID length check. The `Cid` type already
  validates that the digest can be no longer than 64 bytes.

- Be less strict on DagCBOR validation. Counter-intuitively, being
  overly strict here is dangerous as it gives us more points where
  implementations can disagree and fork. Instead, we enforce the
  following rules for DAG_CBOR:

  1. Blocks must have a valid CBOR structure, but _values_ aren't
     validated. E.g., no utf-8 validation, no float validation, no
     minimum encoding requirements, no canonical ordering requirements,
     etc.

  2. All CBOR values tagged with 42 must be valid CIDs. I.e., a CBOR
     byte string starting with a 0x0 byte followed by a valid CID with
     at most a 64 byte digest.
// The actual CID is expected to be a byte string
if maj != 2 {
return Err(anyhow!("expected cbor type byte string in input"));
}
if extra > 100 {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was unnecessary. If the CID is invalid and/or too big, Cid::try_from will catch that.

Comment on lines +155 to 173
match k.codec() {
// We ignore piece commitment CIDs.
FIL_COMMITMENT_UNSEALED | FIL_COMMITMENT_SEALED => continue,
// We allow raw, cbor, and dag cbor.
IPLD_RAW | DAG_CBOR | CBOR => (),
// Everything else is rejected.
codec => return Err(anyhow!("cid {k} has unexpected codec ({codec})")),
}
// Ignore commitments (not even going to check the hash function.
(FIL_COMMITMENT_UNSEALED | FIL_COMMITMENT_SEALED, _, _) => return Ok(()),
// Fail on anything else. We usually want to continue on error, but there's really no going
// back from here.
(codec, hash, length) => {
return Err(anyhow!(
"cid {root} has unexpected codec ({codec}), hash ({hash}), or length ({length})"
))
// Check the hash construction.
match (k.hash().code(), k.hash().size()) {
// Allow non-truncated blake2b-256 and identity hashes.
(BLAKE2B_256, BLAKE2B_LEN) | (IDENTITY, _) => (),
// Reject everything else.
(hash, length) => {
return Err(anyhow!(
"cid {k} has unexpected multihash (code={hash}, len={length})"
))
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous logic was shorter but hard to reason about.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is more readable


Ok(())
self.base
.put_many_keyed(take_reachable(&mut self.write.borrow_mut(), root)?)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to "take" blocks so we wouldn't end up writing them multiple times. The alternative would have been to keep some form of "written" set.

For some history, we used to clear the entire buffer on flush. However, that ended up causing some issues if we, e.g., wanted to flush some intermediate state. Now we take a middle-path of deleting the written blocks but leaving everything else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this then end up saving disk space?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lotus checks if we have a block before writing it, so it shouldn't matter from a disk-space perspective. However, it should save some time:

  1. If we have to flush multiple times. Although I think we only do this in some tools/tests.
  2. If we have identical subtrees and/or common blocks. I could have solved that by keeping a temporary "seen" set while flushing instead of actually removing the flushed blocks from the write buffer, but this way was faster/simpler.

@Stebalien Stebalien requested a review from fridrik01 July 7, 2023 01:08
@codecov-commenter
Copy link

Codecov Report

Merging #1810 (bdaf6a3) into master (8f33aa6) will increase coverage by 0.11%.
The diff coverage is 71.42%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1810      +/-   ##
==========================================
+ Coverage   75.20%   75.31%   +0.11%     
==========================================
  Files         149      149              
  Lines       14593    14558      -35     
==========================================
- Hits        10975    10965      -10     
+ Misses       3618     3593      -25     
Impacted Files Coverage Δ
fvm/src/blockstore/buffered.rs 82.52% <71.42%> (+8.25%) ⬆️

... and 1 file with indirect coverage changes

Comment on lines +155 to 173
match k.codec() {
// We ignore piece commitment CIDs.
FIL_COMMITMENT_UNSEALED | FIL_COMMITMENT_SEALED => continue,
// We allow raw, cbor, and dag cbor.
IPLD_RAW | DAG_CBOR | CBOR => (),
// Everything else is rejected.
codec => return Err(anyhow!("cid {k} has unexpected codec ({codec})")),
}
// Ignore commitments (not even going to check the hash function.
(FIL_COMMITMENT_UNSEALED | FIL_COMMITMENT_SEALED, _, _) => return Ok(()),
// Fail on anything else. We usually want to continue on error, but there's really no going
// back from here.
(codec, hash, length) => {
return Err(anyhow!(
"cid {root} has unexpected codec ({codec}), hash ({hash}), or length ({length})"
))
// Check the hash construction.
match (k.hash().code(), k.hash().size()) {
// Allow non-truncated blake2b-256 and identity hashes.
(BLAKE2B_256, BLAKE2B_LEN) | (IDENTITY, _) => (),
// Reject everything else.
(hash, length) => {
return Err(anyhow!(
"cid {k} has unexpected multihash (code={hash}, len={length})"
))
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is more readable

Comment on lines +70 to +77
let val = match low {
..=23 => low.into(),
24 => read_fixed::<1>(br)?[0].into(),
25 => u16::from_be_bytes(read_fixed(br)?).into(),
26 => u32::from_be_bytes(read_fixed(br)?).into(),
27 => u64::from_be_bytes(read_fixed(br)?),
_ => return Err(anyhow!("invalid header cbor_read_header_buf")),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great refactor for improving readability!

Regarding performance, IIUC this is now doing 2 extra allocations for each call now instead of using a scratch buffer as before?

Copy link
Member Author

@Stebalien Stebalien Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stack allocations so they should be free (except zeroing, but that's pretty cheap and rust can likely optimize it away).


Ok(())
self.base
.put_many_keyed(take_reachable(&mut self.write.borrow_mut(), root)?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this then end up saving disk space?

Copy link
Contributor

@fridrik01 fridrik01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Stebalien Stebalien merged commit 22ba961 into master Jul 10, 2023
@Stebalien Stebalien deleted the refactor/fvm-ipld-flush branch July 10, 2023 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants