Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new index type that also includes mutltihash code #217

Merged
merged 1 commit into from
Sep 7, 2021

Conversation

masih
Copy link
Member

@masih masih commented Sep 1, 2021

Implement a new CARv2 index that contains enough information to
reconstruct the multihashes of the data payload, since CarIndexSorted
only includes multihash digests. The new index builds on top of the
existing IndexSorted by adding an additional layer of grouping the
multi-width indices by their multihash code.

Note, this index intentionally ignores
any given record with multihash.IDENTITY CID hash.

Add a test that asserts offsets for the same CID across sorted index and
new multihash sorted index are consistent.

Add tests that assert marshal unmarshalling of the new index type is as
expected, and it does not load records with multihash.IDENTITY digest.

Relates to:

Fixes #214

@willscott willscott requested a review from mvdan September 1, 2021 13:30
@masih
Copy link
Member Author

masih commented Sep 1, 2021

@willscott This index implementation uses a nested map of length -> multihash.Code -> singleWidthIndex. I am not sure if we ever need to support hash functions that produce variable length digests. IDENTITY is an obvious one but that's a special case which we will be skipping over as captured in #215. If not the storage structure could be simplified.

I'd love to get your feedback whenever you get a chance. 🍻

@masih masih marked this pull request as ready for review September 1, 2021 13:30
v2/index/indexmhsorted.go Outdated Show resolved Hide resolved
@willscott
Copy link
Member

@willscott This index implementation uses a nested map of length -> multihash.Code -> singleWidthIndex. I am not sure if we ever need to support hash functions that produce variable length digests. IDENTITY is an obvious one but that's a special case which we will be skipping over as captured in #215. If not the storage structure could be simplified.

I'd love to get your feedback whenever you get a chance. 🍻

Unless we can think of cases where we wouldn't be able to get away with the simpler case, lets see if we can get away with it?

@ribasushi
Copy link

@masih @willscott one of these days when I carve some time to build what I was planning to build before joining PL, I am definitely going to use multihash-level truncated full-sha3-512 ( can go into the weeds why if interested ). So you won't be able to index my content e.g. ( whether this is a showstopper or not: 🤷 )

@ribasushi
Copy link

multihash-level truncated full-sha3-512

To be precise: 0x01551420{{32bytes}}

@willscott
Copy link
Member

that means the car will have to be read through block by block to calculate the list of full multi-hashes to provide for indexing, rather than being able to crib it off the index of the car. Eventually, it sounds like we should be able to store the computed announcement for indexer nodes if we do have to do the extra work of computation to generate the announcement. that'll probably happen after MVP (although @masih already wrote the walking code once 😅 )

@masih masih marked this pull request as draft September 1, 2021 14:02
@masih masih marked this pull request as ready for review September 1, 2021 20:52
@willscott
Copy link
Member

@rvagg anything we should be aware of in putting together a PR for the additional index multicodec format over in multicodecs?

@rvagg
Copy link
Member

rvagg commented Sep 2, 2021

nope, should be pretty straightforward, just add a 0x0401 and it should get merged in short order

@masih
Copy link
Member Author

masih commented Sep 2, 2021

nope, should be pretty straightforward, just add a 0x0401 and it should get merged in short order

thanks @rvagg, going to open a PR for it just now if we are happy with 0x0401 value

masih added a commit to multiformats/multicodec that referenced this pull request Sep 2, 2021
Define a new codec for CARv2 `MultihashIndexSorted`.

See:
- ipld/go-car#217
- ipld/go-car#214
masih added a commit to multiformats/multicodec that referenced this pull request Sep 2, 2021
Define a new codec for CARv2 `MultihashIndexSorted`.

See:
- ipld/go-car#217
- ipld/go-car#214
@masih masih force-pushed the masih/mh-index-sorted branch 2 times, most recently from c28358b to ac2b95a Compare September 2, 2021 10:03
masih added a commit to multiformats/multicodec that referenced this pull request Sep 2, 2021
Define a new codec for CARv2 `MultihashIndexSorted`.

See:
- ipld/go-car#217
- ipld/go-car#214
Implement a new CARv2 index that contains enough information to
reconstruct the multihashes of the data payload, since `CarIndexSorted`
only includes multihash digests. The new index builds on top of the
existing `IndexSorted` by adding an additional layer of grouping the
multi-width indices by their multihash code.

Note, this index intentionally ignores
any given record with `multihash.IDENTITY` CID hash.

Add a test that asserts offsets for the same CID across sorted index and
new multihash sorted index are consistent.

Add tests that assert marshal unmarshalling of the new index type is as
expected, and it does not load records with `multihash.IDENTITY` digest.

Relates to:
- multiformats/multicodec#227

Fixes:
- #214
Copy link
Member Author

@masih masih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review

// This implementation decodes multihashes twice: once here to group by code, and once in the
// internals of multiWidthIndex to group by digest length. The code can be optimized by
// combining the grouping logic into one step where the multihash of a CID is only decoded once.
// The optimization would need refactoring of the IndexSorted compaction logic.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Postponing the optimisation here to reduce the size of this PR

@masih masih merged commit 42b9e28 into master Sep 7, 2021
@masih masih deleted the masih/mh-index-sorted branch September 7, 2021 10:12
masih added a commit to ipld/ipld that referenced this pull request Sep 20, 2021
Update CARv2 specification to include a format for MultihashIndexSorted
and fully-indexed characteristic bitfield.

Relates to:
- ipld/go-car#239
- ipld/go-car#217
masih added a commit to ipld/ipld that referenced this pull request Sep 20, 2021
Update CARv2 specification to include a format for MultihashIndexSorted
and fully-indexed characteristic bitfield.

Relates to:
- ipld/go-car#239
- ipld/go-car#217
masih added a commit to ipld/ipld that referenced this pull request Sep 20, 2021
Update CARv2 specification to include a format for MultihashIndexSorted
and fully-indexed characteristic bitfield.

Relates to:
- ipld/go-car#239
- ipld/go-car#217
masih added a commit to ipld/ipld that referenced this pull request Sep 20, 2021
Update CARv2 specification to include a format for MultihashIndexSorted
and fully-indexed characteristic bitfield.

Relates to:
- ipld/go-car#239
- ipld/go-car#217
masih added a commit to ipld/ipld that referenced this pull request Sep 21, 2021
Update CARv2 specification to include a format for MultihashIndexSorted
and fully-indexed characteristic bitfield.

Relates to:
- ipld/go-car#239
- ipld/go-car#217
masih added a commit to ipld/ipld that referenced this pull request Sep 24, 2021
Update CARv2 specification to include a format for MultihashIndexSorted
and fully-indexed characteristic bitfield.

Relates to:
- ipld/go-car#239
- ipld/go-car#217
masih added a commit to ipld/ipld that referenced this pull request Sep 27, 2021
Update CARv2 specification to include a format for MultihashIndexSorted
and fully-indexed characteristic bitfield.

Relates to:
- ipld/go-car#239
- ipld/go-car#217
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New Index format that contains the hashing algorithm information
4 participants