Skip to content
This repository has been archived by the owner on Dec 6, 2022. It is now read-only.

[DRAFT] File data representation. #13

Merged
merged 2 commits into from
Nov 26, 2018
Merged

[DRAFT] File data representation. #13

merged 2 commits into from
Nov 26, 2018

Conversation

mikeal
Copy link
Contributor

@mikeal mikeal commented Sep 21, 2018

Here's a new take on the file data representation that attempts to solve many of the previous discussions.

This is a pretty compact structure but it's quite simple and makes seeking into file ranges rather trivial.

SPEC.md Outdated

- 0: Array. Tuple containing two integers, the `start` and `end` offsets of the content.
- 0: Integer: start offset.
- 1: Integer: end offset.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative that would be less error prone for implementations would be to give the size of each chunk rather than the end of the offset.

@mikeal mikeal mentioned this pull request Sep 21, 2018
@mikeal
Copy link
Contributor Author

mikeal commented Oct 4, 2018

If there are no objections in the next week I'm going to merge this into the draft branch.

@warpfork
Copy link

warpfork commented Oct 4, 2018

So do I read correctly that if we compose a tree of file-data... the topmost node in the tree will have increasingly large lengths for each part entry, because it will be the sum of all children lengths?

That sounds pretty simple and seek friendly indeed. 👍

@mikeal
Copy link
Contributor Author

mikeal commented Oct 5, 2018

topmost node in the tree will have increasingly large lengths for each part entry, because it will be the sum of all children lengths?

Correct, but there is actually no requirement that the balance of the tree be symmetrical.

For instance, it's probably more performant to always keep the first few parts in the root of the tree so that you can start reading a file from the beginning without another branch lookup in the tree. In general, I think the most efficient algorithm for building the tree will be to compact it backwards.

@warpfork
Copy link

warpfork commented Oct 6, 2018

We're going to need to be really careful with specing any tree laying behaviors like that to make sure there's one non-contentious Right Answer, for the sake of reproducibly/convergence when multiple uncoordinated users upload the same content.

(True regardless of any tree balancing choice. Just wanted to say it out loud. 😬 )

@warpfork warpfork self-requested a review October 6, 2018 10:07
@mikeal
Copy link
Contributor Author

mikeal commented Oct 6, 2018 via email

@warpfork
Copy link

warpfork commented Oct 6, 2018

That's slightly different than what I suggested.

Logging a bunch of meta info does not give reproducibility/convergence when multiple uncoordinated users upload the same content. The "uncoordinated" part is important.

But this is probably something to be hashed out in not-this-PR.

@Stebalien
Copy link

Relevant: #18

Otherwise, do we want to support offsets? That is, [start, stop, link, (offset in link)]? That would allow for some pretty nice transformations.

Second, do we implicitly support holes?

Also, do we want to make this it's own spec? That is, a sharded byte array spec? (kind of like HAMT being a generic sharded map spec)

@mikeal
Copy link
Contributor Author

mikeal commented Oct 31, 2018

Also, do we want to make this it's own spec? That is, a sharded byte array spec? (kind of like HAMT being a generic sharded map spec)

Eventually, but we should probably punt on it for now in order to get this out quickly. We still don't even have a HAMT spec.

Otherwise, do we want to support offsets? That is, [start, stop, link, (offset in link)]? That would allow for some pretty nice transformations.

I could see how this would be useful in theory but I struggle to imagine the tooling that would end up creating it. Also, you'd want another property for the end of the range in the link.

Second, do we implicitly support holes?

Do unix filesystems support "holes?" If they don't then I'd argue we should try not to. I can't articulate the exact issues but my gut says this could lead to some odd security issues.

@Stebalien
Copy link

I could see how this would be useful in theory but I struggle to imagine the tooling that would end up creating it. Also, you'd want another property for the end of the range in the link.

I assume the range would be implied by stop-start.

One use-case would be @warpfork's "alternative dag" thing. That is, we'd be able to, e.g., take an existing file (in IPFS) and transform it into a TAR file while keeping the original blocks. With the current system, we can do this if we start out knowing we want to represent both a tar and a file but we can't necessarily start with one and convert to the other.

Another use-case is slicing video (although the keyframes make this tricky).

@Stebalien
Copy link

Do unix filesystems support "holes?" If they don't then I'd argue we should try not to. I can't articulate the exact issues but my gut says this could lead to some odd security issues.

They usually do. They're called sparse files.

@Stebalien Stebalien mentioned this pull request Nov 1, 2018
@mikeal
Copy link
Contributor Author

mikeal commented Nov 1, 2018

I assume the range would be implied by stop-start.

Ah, yes, that makes sense. What kind of error condition do we want to have when the linked data is smaller than the described range?

@warpfork
Copy link

warpfork commented Nov 2, 2018

I actually also struggle to imagine using an offset in link part of the tuple. Tooling that generates such a thing would have to find those reusable bits midchunk somehow and I don't know how one would ever do that efficiently. The solution to this in the general case of doing the right thing without advance coordination or oracles of knowledge is to throw Rabin chunking at it and be done with it... which doesn't generate any way to use offsets in chunks.

@warpfork
Copy link

warpfork commented Nov 2, 2018

"holes" / sparse files

I'd like to exclude these from our definition of unixfs, and dissent that they're usually supported by unix filesystems.

Sparse files are supported by some filesystems, but they're not generally present in the POSIX APIs. You can't generally ask a filesystem if a file has sparse hunks in it. You could just infer it from a long run of zeros: but A) that's something I don't think we should really be doing, and B) I don't think most filesystems even do that; you have to use a truncate syscall to allocate a large range of empty file in order to get sparseness effects.

And we don't need to special case this anyway. Long ranges of zeros in a file will... turn into a set of chunks of zeros... which will all dedup with each other.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants