Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support byte range for raw block requests #101

Merged
merged 4 commits into from
May 15, 2024

Conversation

alanshaw
Copy link
Member

@alanshaw alanshaw commented May 13, 2024

Pulls in latest gateway-lib and dagula to allow byte range requests to raw blocks (?format=raw).

supersedes #100

@alanshaw alanshaw marked this pull request as ready for review May 13, 2024 14:43
if (searchParams.get('format') === 'car' || headers.get('Accept')?.includes('application/vnd.ipld.car')) {
return await handleCar(request, env, ctx)
}
return handler(request, env, ctx) // pass to other handlers
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this single handler split into 2 (well, 3 really) to allow the ?format=raw handler to run before the middleware that rejects requests that have HTTP Range headers.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add this as a code comment. I'm also not sure I fully understand what you mean just yet

Copy link

@Gozala Gozala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I do feel however that following two changes would be a recommended, but not required:

  1. Location claims should probably use raw CIDs in the URL as opposed to CAR CIDs. We already had mistagged CAR content which this would avoid. I do have to admit that I don't fully understanding if such a change would invalidate assumptions of other components, in which case it will not be worth it, yet it would be good idea to document that in the code comments.
  2. I think if we have raw CID we should not require additional signals in form of query param or headers. This probably could be a followup improvement

return async (request, env, ctx) => {
const { headers } = request
const { searchParams } = ctx
if (!searchParams) throw new Error('missing URL search params')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why require search params ? Code comment answering this would be helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so this middleware requires a searchParams: URLSearchParams on the context that some other middleware has extracted from the URL prior to it running.

const { headers } = request
const { searchParams } = ctx
if (!searchParams) throw new Error('missing URL search params')
if (searchParams.get('format') === 'raw' || headers.get('Accept')?.includes('application/vnd.ipld.raw')) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why default content type of application/octet-stream would be refused here ? I would personally expect that specifying raw codec in the CID should be enough signal not to require anything in addition

Copy link
Member Author

@alanshaw alanshaw May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This middleware is designed to only handle the request if ?format=raw is in the query string or Accept: application/vnd.ipld.raw is in the HTTP headers.

I will make this handler also handle requests when the CID codec is raw when I can, but it requires some other changes to keep parity with what the unixfs handler does when a raw block is requested (mainly content type sniffing).

if (searchParams.get('format') === 'car' || headers.get('Accept')?.includes('application/vnd.ipld.car')) {
return await handleCar(request, env, ctx)
}
return handler(request, env, ctx) // pass to other handlers
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add this as a code comment. I'm also not sure I fully understand what you mean just yet

if (IndexEntry.isLocated(entry)) {
// if host is "w3s.link" then content can be found in CARPARK
const url = entry.site.location.find(l => l.hostname === 'w3s.link')
// TODO: allow fetch from _any_ URL
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think of location commitments not just as locations but as permissions authorizing the read, meaning you may not be able to simply read from that http endpoint, you may have to check if commitment is delegated to you (the node) and are able to read it.

Might be worth adding some comment to that end here.

Comment on lines 47 to 49
const link = Link.parse(url.pathname.split('/')[2])
const digestString = base58btc.encode(link.multihash.bytes)
const key = `${digestString}/${digestString}.blob`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would recommend factoring out this piece of code into a separate function, perhaps something like parseSite(url: URL): { key: string, range: Range }

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or alternatively something like IndexEntry.toBucketGet(entry: IndexEntry): [key:string, options?:GetOptions]

}
}

export type IndexEntry = NotLocatedIndexEntry | LocatedIndexEntry
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I can't stop thinking of two as lost & found as in type IndexEntry = LostIndexEntry | FoundIndexEntry

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 if you're ok with that I can get behind it. I was thinking lost implies that you had it at some point...

}

export type IndexEntry = NotLocatedIndexEntry | LocatedIndexEntry

export interface Index {
get (c: UnknownLink): Promise<IndexEntry|undefined>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: As alluded earlier I think Result is a much better option because:

  1. Interface tells you exactly why you did not got anything - "could not find entry"
  2. There is a room to grow, as in if we need to do some IO (which can fail) we can distinguish not found from failed to read and potentially more things like unsupported input link etc...
Suggested change
get (c: UnknownLink): Promise<IndexEntry|undefined>
get (c: UnknownLink): Promise<Result<IndexEntry, NotFound>>

Copy link
Member Author

@alanshaw alanshaw May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I mean I wrote all this before I even knew about Result. I don't disagree (I like the type information Result gives us for errors) but I think we should do a transition to it separately to this PR.

@@ -136,7 +133,11 @@ export class ContentClaimsIndex {

const entries = await decodeIndex(content, block.bytes)
for (const entry of entries) {
this.#cache.set(Link.create(raw.code, entry.multihash), entry)
const entryCid = Link.create(raw.code, entry.multihash)
// do not overwrite an existing LocatedIndexEntry
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Do we just drop it ? Can we have different entry for the same key ? I think it would really help to address in the comment above why dropping entries after we have one in cache is desired behavior

/** @param {import('multiformats').MultihashDigest} digest */
export const toBlobKey = digest => {
const digestString = base58btc.encode(digest.bytes)
return `${digestString}/${digestString}.blob`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I've been thinking whether doing something like blob/${digestSstring}.blob would be a better option ? At least visually it would be easier to distinguish that looking at car vs blob hashes and potentially other things in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vasco-santos WDYT? We originally used cid/cid.car so that we could put indexes next to the data, like cid/cid.car.idx but we never did that and used a different bucket instead. So I'm not too bothered about having the extra "directory" per upload.

This would also be a good way to segregate new blob data from the old.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also do this keying with pathing to better optimize S3 Rate limits. Re R2 Rate limits, this was not rate limited like S3, but last time we talked (6 months + ago), they would be looking into having that possibility on their end.

So, I would push back on doing this change, and prioritize to finish what we need on the reads interface to swap the bucket

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we have different tables in Dynamo for Blobs and Store allocations. Therefore, if we ever need to go through the whole list, we can rely on dynamo instead of a list with prefix from bucket

const blocks = claims.get(cid) ?? []
// @ts-expect-error
blocks.push(await encode(invocation))
const location = new URL(`https://w3s.link/ipfs/${carCid}?format=raw`)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use raw CIDs here as opposed to CAR CIDs with format=raw should not we ?

package.json Outdated
@@ -41,7 +41,7 @@
"@ipld/dag-json": "^10.1.7",
"@ipld/dag-pb": "^4.0.8",
"@web3-storage/content-claims": "^4.0.5",
"@web3-storage/gateway-lib": "^4.1.1",
"@web3-storage/gateway-lib": "github:web3-storage/gateway-lib#5a027e05be62f985407b46bce70748f543d302b7",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs update

/** @param {import('multiformats').MultihashDigest} digest */
export const toBlobKey = digest => {
const digestString = base58btc.encode(digest.bytes)
return `${digestString}/${digestString}.blob`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also do this keying with pathing to better optimize S3 Rate limits. Re R2 Rate limits, this was not rate limited like S3, but last time we talked (6 months + ago), they would be looking into having that possibility on their end.

So, I would push back on doing this change, and prioritize to finish what we need on the reads interface to swap the bucket

@alanshaw alanshaw merged commit 1ff3bad into main May 15, 2024
1 check passed
@alanshaw alanshaw deleted the feat/support-byte-range-reqs-for-raw-block branch May 15, 2024 10:06
alanshaw pushed a commit that referenced this pull request May 15, 2024
🤖 I have created a release *beep* *boop*
---


##
[2.17.0](v2.16.0...v2.17.0)
(2024-05-15)


### Features

* support byte range for raw block requests
([#101](#101))
([1ff3bad](1ff3bad))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants