[DRAFT] File data representation. #13

mikeal · 2018-09-21T22:15:51Z

Here's a new take on the file data representation that attempts to solve many of the previous discussions.

This is a pretty compact structure but it's quite simple and makes seeking into file ranges rather trivial.

mikeal · 2018-09-21T22:22:51Z

SPEC.md

+
+  - 0: Array. Tuple containing two integers, the `start` and `end` offsets of the content.
+    - 0: Integer: start offset.
+    - 1: Integer: end offset.


An alternative that would be less error prone for implementations would be to give the size of each chunk rather than the end of the offset.

mikeal · 2018-10-04T17:51:39Z

If there are no objections in the next week I'm going to merge this into the draft branch.

warpfork · 2018-10-04T22:28:06Z

So do I read correctly that if we compose a tree of file-data... the topmost node in the tree will have increasingly large lengths for each part entry, because it will be the sum of all children lengths?

That sounds pretty simple and seek friendly indeed. 👍

mikeal · 2018-10-05T16:39:02Z

topmost node in the tree will have increasingly large lengths for each part entry, because it will be the sum of all children lengths?

Correct, but there is actually no requirement that the balance of the tree be symmetrical.

For instance, it's probably more performant to always keep the first few parts in the root of the tree so that you can start reading a file from the beginning without another branch lookup in the tree. In general, I think the most efficient algorithm for building the tree will be to compact it backwards.

warpfork · 2018-10-06T10:07:14Z

We're going to need to be really careful with specing any tree laying behaviors like that to make sure there's one non-contentious Right Answer, for the sake of reproducibly/convergence when multiple uncoordinated users upload the same content.

(True regardless of any tree balancing choice. Just wanted to say it out loud. 😬 )

mikeal · 2018-10-06T14:21:55Z

Yup. One thing to keep in mind is that anything we change in the chunking algorithn or even the way we write the parts for that algorithm will change the hash of the final file object, so everything optional we do should be written into the metadata somewhere so that another implementation could reproduce the same file object.

…

On Sat, Oct 6, 2018, 11:07 AM Eric Myhre ***@***.***> wrote: ***@***.**** approved this pull request. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAACQx3_SSoLnpUtmC1BIJ8bhwnvgWUfks5uiIDqgaJpZM4W08m2> .

warpfork · 2018-10-06T15:04:07Z

That's slightly different than what I suggested.

Logging a bunch of meta info does not give reproducibility/convergence when multiple uncoordinated users upload the same content. The "uncoordinated" part is important.

But this is probably something to be hashed out in not-this-PR.

Stebalien · 2018-10-31T16:12:50Z

Relevant: #18

Otherwise, do we want to support offsets? That is, [start, stop, link, (offset in link)]? That would allow for some pretty nice transformations.

Second, do we implicitly support holes?

Also, do we want to make this it's own spec? That is, a sharded byte array spec? (kind of like HAMT being a generic sharded map spec)

mikeal · 2018-10-31T20:28:43Z

Also, do we want to make this it's own spec? That is, a sharded byte array spec? (kind of like HAMT being a generic sharded map spec)

Eventually, but we should probably punt on it for now in order to get this out quickly. We still don't even have a HAMT spec.

Otherwise, do we want to support offsets? That is, [start, stop, link, (offset in link)]? That would allow for some pretty nice transformations.

I could see how this would be useful in theory but I struggle to imagine the tooling that would end up creating it. Also, you'd want another property for the end of the range in the link.

Second, do we implicitly support holes?

Do unix filesystems support "holes?" If they don't then I'd argue we should try not to. I can't articulate the exact issues but my gut says this could lead to some odd security issues.

Stebalien · 2018-11-01T18:52:30Z

I could see how this would be useful in theory but I struggle to imagine the tooling that would end up creating it. Also, you'd want another property for the end of the range in the link.

I assume the range would be implied by stop-start.

One use-case would be @warpfork's "alternative dag" thing. That is, we'd be able to, e.g., take an existing file (in IPFS) and transform it into a TAR file while keeping the original blocks. With the current system, we can do this if we start out knowing we want to represent both a tar and a file but we can't necessarily start with one and convert to the other.

Another use-case is slicing video (although the keyframes make this tricky).

Stebalien · 2018-11-01T18:53:02Z

Do unix filesystems support "holes?" If they don't then I'd argue we should try not to. I can't articulate the exact issues but my gut says this could lead to some odd security issues.

They usually do. They're called sparse files.

mikeal · 2018-11-01T19:14:33Z

I assume the range would be implied by stop-start.

Ah, yes, that makes sense. What kind of error condition do we want to have when the linked data is smaller than the described range?

warpfork · 2018-11-02T10:15:42Z

I actually also struggle to imagine using an offset in link part of the tuple. Tooling that generates such a thing would have to find those reusable bits midchunk somehow and I don't know how one would ever do that efficiently. The solution to this in the general case of doing the right thing without advance coordination or oracles of knowledge is to throw Rabin chunking at it and be done with it... which doesn't generate any way to use offsets in chunks.

warpfork · 2018-11-02T10:19:30Z

"holes" / sparse files

I'd like to exclude these from our definition of unixfs, and dissent that they're usually supported by unix filesystems.

Sparse files are supported by some filesystems, but they're not generally present in the POSIX APIs. You can't generally ask a filesystem if a file has sparse hunks in it. You could just infer it from a long run of zeros: but A) that's something I don't think we should really be doing, and B) I don't think most filesystems even do that; you have to use a truncate syscall to allocate a large range of empty file in order to get sparseness effects.

And we don't need to special case this anyway. Long ranges of zeros in a file will... turn into a set of chunks of zeros... which will all dedup with each other.

initial: file-data definition.

dc1949a

mikeal commented Sep 21, 2018

View reviewed changes

mikeal mentioned this pull request Sep 21, 2018

"Size" fields to include #9

Closed

fix: using length rather than ending offset

4113dd6

warpfork self-requested a review October 6, 2018 10:07

warpfork approved these changes Oct 6, 2018

View reviewed changes

Stebalien mentioned this pull request Nov 1, 2018

Graph Agnostic #18

Closed

Stebalien mentioned this pull request Nov 2, 2018

Why are we chunking files and not chunking blocks? ipfs/notes#300

Open

mikeal merged commit 1d27441 into draft Nov 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] File data representation. #13

[DRAFT] File data representation. #13

mikeal commented Sep 21, 2018

mikeal Sep 21, 2018

mikeal commented Oct 4, 2018

warpfork commented Oct 4, 2018

mikeal commented Oct 5, 2018

warpfork commented Oct 6, 2018

mikeal commented Oct 6, 2018 via email

warpfork commented Oct 6, 2018

Stebalien commented Oct 31, 2018

mikeal commented Oct 31, 2018

Stebalien commented Nov 1, 2018

Stebalien commented Nov 1, 2018

mikeal commented Nov 1, 2018

warpfork commented Nov 2, 2018

warpfork commented Nov 2, 2018

[DRAFT] File data representation. #13

[DRAFT] File data representation. #13

Conversation

mikeal commented Sep 21, 2018

mikeal Sep 21, 2018

Choose a reason for hiding this comment

mikeal commented Oct 4, 2018

warpfork commented Oct 4, 2018

mikeal commented Oct 5, 2018

warpfork commented Oct 6, 2018

mikeal commented Oct 6, 2018 via email

warpfork commented Oct 6, 2018

Stebalien commented Oct 31, 2018

mikeal commented Oct 31, 2018

Stebalien commented Nov 1, 2018

Stebalien commented Nov 1, 2018

mikeal commented Nov 1, 2018

warpfork commented Nov 2, 2018

warpfork commented Nov 2, 2018