feat: support byte range for raw block requests #101

alanshaw · 2024-05-13T11:39:01Z

Pulls in latest gateway-lib and dagula to allow byte range requests to raw blocks (?format=raw).

supersedes #100

alanshaw · 2024-05-13T14:45:32Z

src/index.js

+    if (searchParams.get('format') === 'car' || headers.get('Accept')?.includes('application/vnd.ipld.car')) {
+      return await handleCar(request, env, ctx)
+    }
+    return handler(request, env, ctx) // pass to other handlers


Note: this single handler split into 2 (well, 3 really) to allow the ?format=raw handler to run before the middleware that rejects requests that have HTTP Range headers.

Can you please add this as a code comment. I'm also not sure I fully understand what you mean just yet

Gozala

Looks good, I do feel however that following two changes would be a recommended, but not required:

Location claims should probably use raw CIDs in the URL as opposed to CAR CIDs. We already had mistagged CAR content which this would avoid. I do have to admit that I don't fully understanding if such a change would invalidate assumptions of other components, in which case it will not be worth it, yet it would be good idea to document that in the code comments.
I think if we have raw CID we should not require additional signals in form of query param or headers. This probably could be a followup improvement

Gozala · 2024-05-13T17:09:02Z

src/index.js

+  return async (request, env, ctx) => {
+    const { headers } = request
+    const { searchParams } = ctx
+    if (!searchParams) throw new Error('missing URL search params')


Why require search params ? Code comment answering this would be helpful.

Oh, so this middleware requires a searchParams: URLSearchParams on the context that some other middleware has extracted from the URL prior to it running.

Gozala · 2024-05-13T17:12:04Z

src/index.js

+    const { headers } = request
+    const { searchParams } = ctx
+    if (!searchParams) throw new Error('missing URL search params')
+    if (searchParams.get('format') === 'raw' || headers.get('Accept')?.includes('application/vnd.ipld.raw')) {


I'm not sure I understand why default content type of application/octet-stream would be refused here ? I would personally expect that specifying raw codec in the CID should be enough signal not to require anything in addition

This middleware is designed to only handle the request if ?format=raw is in the query string or Accept: application/vnd.ipld.raw is in the HTTP headers.

I will make this handler also handle requests when the CID codec is raw when I can, but it requires some other changes to keep parity with what the unixfs handler does when a raw block is requested (mainly content type sniffing).

Gozala · 2024-05-13T17:15:17Z

src/index.js

+    if (searchParams.get('format') === 'car' || headers.get('Accept')?.includes('application/vnd.ipld.car')) {
+      return await handleCar(request, env, ctx)
+    }
+    return handler(request, env, ctx) // pass to other handlers


Can you please add this as a code comment. I'm also not sure I fully understand what you mean just yet

Gozala · 2024-05-13T17:19:18Z

src/lib/blockstore.js

+    if (IndexEntry.isLocated(entry)) {
+      // if host is "w3s.link" then content can be found in CARPARK
+      const url = entry.site.location.find(l => l.hostname === 'w3s.link')
+      // TODO: allow fetch from _any_ URL


Nit: I think of location commitments not just as locations but as permissions authorizing the read, meaning you may not be able to simply read from that http endpoint, you may have to check if commitment is delegated to you (the node) and are able to read it.

Might be worth adding some comment to that end here.

Gozala · 2024-05-13T17:23:01Z

src/lib/blockstore.js

+      const link = Link.parse(url.pathname.split('/')[2])
+      const digestString = base58btc.encode(link.multihash.bytes)
+      const key = `${digestString}/${digestString}.blob`


Nit: I would recommend factoring out this piece of code into a separate function, perhaps something like parseSite(url: URL): { key: string, range: Range }

Or alternatively something like IndexEntry.toBucketGet(entry: IndexEntry): [key:string, options?:GetOptions]

Gozala · 2024-05-13T17:42:05Z

src/lib/dag-index/api.ts

+  }
+}
+
+export type IndexEntry = NotLocatedIndexEntry | LocatedIndexEntry


Nit: I can't stop thinking of two as lost & found as in type IndexEntry = LostIndexEntry | FoundIndexEntry

🤷 if you're ok with that I can get behind it. I was thinking lost implies that you had it at some point...

Gozala · 2024-05-13T17:45:52Z

src/lib/dag-index/api.ts

+}
+
+export type IndexEntry = NotLocatedIndexEntry | LocatedIndexEntry
+
 export interface Index {
  get (c: UnknownLink): Promise<IndexEntry|undefined>


Nit: As alluded earlier I think Result is a much better option because:

Interface tells you exactly why you did not got anything - "could not find entry"

There is a room to grow, as in if we need to do some IO (which can fail) we can distinguish not found from failed to read and potentially more things like unsupported input link etc...

Suggested change

get (c: UnknownLink): Promise<IndexEntry|undefined>

get (c: UnknownLink): Promise<Result<IndexEntry, NotFound>>

Right, I mean I wrote all this before I even knew about Result. I don't disagree (I like the type information Result gives us for errors) but I think we should do a transition to it separately to this PR.

Gozala · 2024-05-13T17:51:56Z

src/lib/dag-index/content-claims.js

@@ -136,7 +133,11 @@ export class ContentClaimsIndex {

        const entries = await decodeIndex(content, block.bytes)
        for (const entry of entries) {
-          this.#cache.set(Link.create(raw.code, entry.multihash), entry)
+          const entryCid = Link.create(raw.code, entry.multihash)
+          // do not overwrite an existing LocatedIndexEntry


Nit: Do we just drop it ? Can we have different entry for the same key ? I think it would really help to address in the comment above why dropping entries after we have one in cache is desired behavior

Gozala · 2024-05-13T18:10:27Z

test/helpers/builder.js

+/** @param {import('multiformats').MultihashDigest} digest */
+export const toBlobKey = digest => {
+  const digestString = base58btc.encode(digest.bytes)
+  return `${digestString}/${digestString}.blob`


Nit: I've been thinking whether doing something like blob/${digestSstring}.blob would be a better option ? At least visually it would be easier to distinguish that looking at car vs blob hashes and potentially other things in the future.

@vasco-santos WDYT? We originally used cid/cid.car so that we could put indexes next to the data, like cid/cid.car.idx but we never did that and used a different bucket instead. So I'm not too bothered about having the extra "directory" per upload.

This would also be a good way to segregate new blob data from the old.

We also do this keying with pathing to better optimize S3 Rate limits. Re R2 Rate limits, this was not rate limited like S3, but last time we talked (6 months + ago), they would be looking into having that possibility on their end.

So, I would push back on doing this change, and prioritize to finish what we need on the reads interface to swap the bucket

Also, we have different tables in Dynamo for Blobs and Store allocations. Therefore, if we ever need to go through the whole list, we can rely on dynamo instead of a list with prefix from bucket

Gozala · 2024-05-13T18:12:36Z

test/helpers/content-claims.js

        const blocks = claims.get(cid) ?? []
-        // @ts-expect-error
-        blocks.push(await encode(invocation))
+        const location = new URL(`https://w3s.link/ipfs/${carCid}?format=raw`)


I think we should use raw CIDs here as opposed to CAR CIDs with format=raw should not we ?

vasco-santos · 2024-05-13T15:26:14Z

package.json

@@ -41,7 +41,7 @@
    "@ipld/dag-json": "^10.1.7",
    "@ipld/dag-pb": "^4.0.8",
    "@web3-storage/content-claims": "^4.0.5",
-    "@web3-storage/gateway-lib": "^4.1.1",
+    "@web3-storage/gateway-lib": "github:web3-storage/gateway-lib#5a027e05be62f985407b46bce70748f543d302b7",


needs update

vasco-santos · 2024-05-15T05:44:05Z

test/helpers/builder.js

+/** @param {import('multiformats').MultihashDigest} digest */
+export const toBlobKey = digest => {
+  const digestString = base58btc.encode(digest.bytes)
+  return `${digestString}/${digestString}.blob`


We also do this keying with pathing to better optimize S3 Rate limits. Re R2 Rate limits, this was not rate limited like S3, but last time we talked (6 months + ago), they would be looking into having that possibility on their end.

So, I would push back on doing this change, and prioritize to finish what we need on the reads interface to swap the bucket

🤖 I have created a release *beep* *boop* --- ## [2.17.0](v2.16.0...v2.17.0) (2024-05-15) ### Features * support byte range for raw block requests ([#101](#101)) ([1ff3bad](1ff3bad)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Alan Shaw added 3 commits May 13, 2024 12:38

feat: support byte range for raw block requests

aba6c0a

fix: type erroras

58e5821

test: add tests

fdaa475

alanshaw marked this pull request as ready for review May 13, 2024 14:43

alanshaw requested review from vasco-santos and Gozala May 13, 2024 14:49

alanshaw commented May 13, 2024

View reviewed changes

Gozala approved these changes May 13, 2024

View reviewed changes

vasco-santos approved these changes May 15, 2024

View reviewed changes

refactor: address review comments

679d792

alanshaw merged commit 1ff3bad into main May 15, 2024
1 check passed

alanshaw deleted the feat/support-byte-range-reqs-for-raw-block branch May 15, 2024 10:06

github-actions bot mentioned this pull request May 15, 2024

chore(main): release 2.17.0 #102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support byte range for raw block requests #101

feat: support byte range for raw block requests #101

alanshaw commented May 13, 2024 •

edited

Loading

alanshaw May 13, 2024

Gozala May 13, 2024

Gozala left a comment

Gozala May 13, 2024

alanshaw May 15, 2024

Gozala May 13, 2024

alanshaw May 15, 2024 •

edited

Loading

Gozala May 13, 2024

Gozala May 13, 2024

Gozala May 13, 2024

Gozala May 13, 2024

Gozala May 13, 2024

alanshaw May 14, 2024

Gozala May 13, 2024

alanshaw May 14, 2024 •

edited

Loading

Gozala May 13, 2024

Gozala May 13, 2024

alanshaw May 14, 2024

vasco-santos May 15, 2024

vasco-santos May 15, 2024

Gozala May 13, 2024

vasco-santos May 13, 2024

vasco-santos May 15, 2024

	get (c: UnknownLink): Promise<IndexEntry\|undefined>
	get (c: UnknownLink): Promise<Result<IndexEntry, NotFound>>

feat: support byte range for raw block requests #101

feat: support byte range for raw block requests #101

Conversation

alanshaw commented May 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gozala left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanshaw May 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanshaw May 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanshaw commented May 13, 2024 •

edited

Loading

alanshaw May 15, 2024 •

edited

Loading

alanshaw May 14, 2024 •

edited

Loading