Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate size in the DagReaders #4680

Closed
wants to merge 12 commits into from
Closed

Validate size in the DagReaders #4680

wants to merge 12 commits into from

Conversation

kevina
Copy link
Contributor

@kevina kevina commented Feb 11, 2018

If a node is too large, it is truncated; if it is too small, it is zero extended.

Closes #4540. Closes #4667.

@ghost ghost assigned kevina Feb 11, 2018
@ghost ghost added the status/in-progress In progress label Feb 11, 2018
@ivan386
Copy link
Contributor

ivan386 commented Feb 11, 2018

>ipfs get -LD QmUeec6DpWz2CA9uZ7GgPAB5nc6QYDg8yax4Wapxs1LSQL
2018-02-11 14:33:45.265625 DEBUG blockservice blockservice.go:183: BlockService GetBlock: 'QmS7RSorsfu1MJpBjcu7L1W2QfsoFDKSrvqAiHzLY4EUH4'
2018-02-11 14:33:45.281250 DEBUG blockservice blockservice.go:183: BlockService GetBlock: 'QmWWC1GHCa7at4K6oMK7Wm7odYCphhRvXYXKAJBreMhRLu'
2018-02-11 14:33:45.281250 DEBUG blockservice blockservice.go:183: BlockService GetBlock: 'QmSKboVigcD3AY4kLsob117KJcMHvMUu6vNFqk1PQzYUpp'
2018-02-11 14:33:45.281250 DEBUG blockservice blockservice.go:183: BlockService GetBlock: 'QmUNLLsPACCz1vLxQVkXqqLX5R1X345qqfHbsf67hvA3Nn'
2018-02-11 14:33:45.296875 DEBUG path resolver.go:55: Resolve: 'QmUeec6DpWz2CA9uZ7GgPAB5nc6QYDg8yax4Wapxs1LSQL'
2018-02-11 14:33:45.296875 DEBUG path resolver.go:144: resolve dag get
2018-02-11 14:33:45.296875 DEBUG blockservice blockservice.go:183: BlockService GetBlock: 'QmUeec6DpWz2CA9uZ7GgPAB5nc6QYDg8yax4Wapxs1LSQL'
Saving file(s) to QmUeec6DpWz2CA9uZ7GgPAB5nc6QYDg8yax4Wapxs1LSQL
 0 B / 3.05 KB [------------------------------------------------------------------------------------------------------------------]   0.00%2
018-02-11 14:33:45.296875 DEBUG blockservice blockservice.go:239: Blockservice: Got data in datastore
2018-02-11 14:33:45.312500 DEBUG blockservice blockservice.go:239: Blockservice: Got data in datastore
2018-02-11 14:33:45.312500 DEBUG blockservice blockservice.go:239: Blockservice: Got data in datastore
2018-02-11 14:33:45.312500 DEBUG blockservice blockservice.go:239: Blockservice: Got data in datastore
2018-02-11 14:33:45.312500 DEBUG blockservice blockservice.go:239: Blockservice: Got data in datastore
 3.05 KB / 3.05 KB [===========================================================================================================] 100.00% 0s
Error: unexpected EOF
2018-02-11 14:33:45.328125 INFO command request.go:105: Shutting down node...
2018-02-11 14:33:45.328125 DEBUG core core.go:544: core is shutting down...
2018-02-11 14:33:45.328125 DEBUG blockservice blockservice.go:274: blockservice is shutting down...

No blocksizes. File size bigger than all data. All links write to file.

@Stebalien
Copy link
Member

Stebalien commented Feb 11, 2018

Nice!

As @ivan386 says, we should check if filesize == sum(blocksize). Personally, I'd just fail the entire chunk if that check fails (if you want a hole at the beginning of a file, you can always point to the empty node and give it a non-empty size).

Edit: Alternatively, we could do the same zero-extend/truncate thing.

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish there were a simpler way to do this but I'm pretty sure there isn't.

r.offset += int64(n)
return n, err
}
for ; n < len(p) && r.offset+int64(n) < r.size; n++ { // pad
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This may confuse the optimizer. I'd compute the bounds up-front. Possibly something like:

remaining := r.size - (r.offset + int64(n))
if remaining < len(p)-n {
    remaining = len(p)-n
}
for i := range p[n:n+remaining] {
    p[i] = 0
}
n += remaining

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can still do this but I am reluctant to do to the increase in code size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entirely up to you. I seriously doubt this will be a bottleneck.

}
n, err := r.base.Read(p)
if err == nil {
_, err = r.base.Read(nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just allow the short read and force the outer reader to loop. Unfortunately, a lot of readers don't like read attempts with empty buffers (and I'd rather not do two reads per outer read).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did, that initially and it broke some code I will look into it more closely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not good...

}
// Its easier just to always use io.SeekStart rather than
// correctly adjust offset for io.SeekCurrent and io.SeekEnd.
return r.base.Seek(r.offset, io.SeekStart)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this return r.offset? base.Seek could return a short offset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

r.offset = r.size + offset
}
if r.offset < 0 {
return -1, errors.New("Seek will result in negative position")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this rollback the offset (technically)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will fix.

@kevina
Copy link
Contributor Author

kevina commented Feb 11, 2018

@Stebalien, two questions

  1. If there are links defined but no corresponding blocksizes, should we consider the block ill-formed and fail with an error, after all seeking is impossible (this came up in one of the test cases on a dag that was likely constructed manually, QmTTDNxNJUqF5PZicJqjDjsUed1bHaPWwFdEMvb1iY93Ur in t0110-gateway.sh test Add compact blocks)

  2. If filesize != sum(blocksize) which should take precedence, I think it would be better to consider it ill-formed and fail with an error. Is that what you mean when you said "I'd just fail the entire chunk if that check fails"?

@Stebalien
Copy link
Member

If there are links defined but no corresponding blocksizes, should we consider the block ill-formed and fail with an error...

IMO, yes.

If filesize != sum(blocksize) which should take precedence, I think it would be better to consider it ill-formed and fail with an error. Is that what you mean when you said "I'd just fail the entire chunk if that check fails"?

Yes. I consider this situation to be the same as (1).

@ivan386
Copy link
Contributor

ivan386 commented Feb 11, 2018

if you want a hole at the beginning of a file, you can always point to the empty node and give it a non-empty size

For test: QmRqh1ckCLG6nXoE1aJQT2CHrByk5DWji8v3s4bHcmN975

>ipfs dag get QmRqh1ckCLG6nXoE1aJQT2CHrByk5DWji8v3s4bHcmN975
{"data":"CAIS4AMwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAYwBYg4AMg4AMg4AMg4AMg4AM=",
"links":[
{"Cid":{"/":"zb2rhmy65F3REf8SZp7De11gxtECBGgUKaLdiDj7MCGCHxbDW"},"Name":"1","Size":0},
{"Cid":{"/":"zb2rhb3bbxnrqKS2QopcDvbCz5F9nRW7MFzRkhH7LoqhG8AyD"},"Name":"2","Size":480},
{"Cid":{"/":"zb2rhahcgW9tVL7cbPnbdf8y5DzYdbLqCKh2AeBto1JnyCErj"},"Name":"3","Size":480},
{"Cid":{"/":"zb2rhbo1XsFWFP3LCF14NHDikCBn1o8hXVZ6BtR5hvntU5PRp"},"Name":"4","Size":480},
{"Cid":{"/":"zb2rhhkerjTtMxmCwrT9UZC3z7vVuLK8nPnMJzBFRwrEQtiyL"},"Name":"5","Size":480}
]}

Here zero size raw link: zb2rhmy65F3REf8SZp7De11gxtECBGgUKaLdiDj7MCGCHxbDW
Need to detect that zero data links and don't try to download it. We waste time on it.

I think that aligning blocks to the end of the file is the best solution. Less data in block.

ipfs add do not allow add zero sized file as raw link. It return: QmbFMke1KXqnYyBBWxB74N4c5SBnJMVAiMNRcGu6x1AwQH

ipfs dag get QmbFMke1KXqnYyBBWxB74N4c5SBnJMVAiMNRcGu6x1AwQH
{"data":"CAIYAA==","links":[]}

@@ -188,7 +188,7 @@ test_expect_success "Add compact blocks" '
printf "foofoo" > expected
'

test_expect_success "GET compact blocks succeeds" '
test_expect_failure "GET compact blocks succeeds" '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sorts of changes make me nervous. Whats the technical reason for the swap?

Copy link
Contributor Author

@kevina kevina Feb 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the test uses a unixfs node that does not define any blocksizes. Per earlier discussion:

@Stebalien, two questions

  1. If there are links defined but no corresponding blocksizes, should we consider the block ill-formed and fail with an error, after all seeking is impossible (this came up in one of the test cases on a dag that was likely constructed manually, QmTTDNxNJUqF5PZicJqjDjsUed1bHaPWwFdEMvb1iY93Ur in t0110-gateway.sh test Add compact blocks)

IMO, yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not good... I assumed that all of the files we produced did had blocksizes listed. @whyrusleeping did we change this at some point?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien I think this test is testing a specifically constructed block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a decent number of unit tests for stuff like this. Testing this sort of specific detailed logic via sharness feels wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whyrusleeping thoughs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevina here is the background on this testcase: #4286 (comment)

Specifically:

Additionally - the way my project chunks data, my average blocksize is around 4096 bytes ( with a hard-limit on block sizes at 64k ), so forgoing any optional metadata is critical for me in order to reduce the size of my objects ( as the overhead adds up incredibly fast when one deals with hundreds of millions of objects in a DAG ).

@whyrusleeping's response:

The sizes are optional

@kevina please do not break this relied-upon feature ( neither in PB- nor in the future CBOR-UnixFS )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is a problem. Unix programs expect files to be seekable and will break horribly if they aren't (and I'm not going to break every single unix program ever written). However, I also don't want to break existing uses if I can help it and this problem doesn't seem insurmountable.

We should be able to fall back on dumb seeking (although we may want to disable this from the gateway). Basically, the logic would be as follows:

  1. If blocksizes are defined, use them.
  2. If a block with children has no blocksizes defined (none at all, not some, not half, not incorrect ones, none), dumb seek through that portion of the file.

This actually gives us the best of both worlds. Files with lots of tiny blocks can put blocksizes at the top layers of the DAG and omit them in the lower levels.


In terms of implementation, we can punt on dumb seeking for now. Simply validate blocksizes if present and error when seeking when they're not present.

@kevina how hard would this be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will see what I can do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

curLinks := getLinkCids(n)
data := pb.GetData()

// validate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to have this logic broken out separately and have a few unit tests for it

Copy link
Contributor Author

@kevina kevina Feb 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what you mean here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validate logic that has been newly added. It should live in its own function. And then we should add a few units tests for that function.

Copy link
Contributor Author

@kevina kevina Feb 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see what I can do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay this is done now. Sorry for the delay.

@Stebalien
Copy link
Member

Stebalien commented Feb 12, 2018

@ivan386

I think that aligning blocks to the end of the file is the best solution. Less data in block.

It adds complexity. I don't want to have one case where we zero-pad the end and another where we zero-pad the beginning.

ipfs add do not allow add zero sized file as raw link. It return: QmbFMke1KXqnYyBBWxB74N4c5SBnJMVAiMNRcGu6x1AwQH

That's a bug. Would you mind reporting it? Also, we technically have an "identity" hash function so, eventually, we should be able to encode this empty block as 4 bytes: \x01\x37\x00\x00 (CIDv1, DagRaw, Multihash-ID, Length 0).

Note: The empty block is zb2rhWm2M1wXXqtqU6pHfovz3DZQ7D54ZD2xN3ynwankHCBCn (can add it with ipfs block put --format=raw <<<'').

@ivan386
Copy link
Contributor

ivan386 commented Feb 12, 2018

@Stebalien

That's a bug. Would you mind reporting it?

Reported: #4688

Note: The empty block is zb2rhWm2M1wXXqtqU6pHfovz3DZQ7D54ZD2xN3ynwankHCBCn (can add it with ipfs block put --format=raw <<<'').

zb2rhWm2M1wXXqtqU6pHfovz3DZQ7D54ZD2xN3ynwankHCBCn is 1 byte block with line feed(0A)

Real zero length raw block is: zb2rhmy65F3REf8SZp7De11gxtECBGgUKaLdiDj7MCGCHxbDW

Also, we technically have an "identity" hash function so, eventually, we should be able to encode this empty block as 4 bytes: \x01\x37\x00\x00 (CIDv1, DagRaw, Multihash-ID, Length 0).

Convert it to base58: z2oTmZ

ipfs dag put -f"protobuf" test_dag_identity.json
Error: multihash too short. must be > 3 bytes

test_dag_identity.json:

{"data":"CAIS4AMwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAYwBYg4AMg4AMg4AMg4AMg4AM=",
"links":[
{"Name":"1","Size":0,"Cid":{"/":"z2oTmZ"}},
{"Name":"2","Size":480,"Cid":{"/":"zb2rhb3bbxnrqKS2QopcDvbCz5F9nRW7MFzRkhH7LoqhG8AyD"}},
{"Name":"3","Size":480,"Cid":{"/":"zb2rhahcgW9tVL7cbPnbdf8y5DzYdbLqCKh2AeBto1JnyCErj"}},
{"Name":"4","Size":480,"Cid":{"/":"zb2rhbo1XsFWFP3LCF14NHDikCBn1o8hXVZ6BtR5hvntU5PRp"}},
{"Name":"5","Size":480,"Cid":{"/":"zb2rhhkerjTtMxmCwrT9UZC3z7vVuLK8nPnMJzBFRwrEQtiyL"}}
]}

Report new bug?

PS:
01 37 00 03 00 00 00 -> z3fsQ7Xw35

>ipfs get z3fsQ7Xw35
Error: merkledag: not found
>ipfs block get z3fsQ7Xw35
Error: blockservice: key not found

PPS:
01 55 00 03 00 00 00 -> z3vosKrgsh

Same error

ipfs get QmPwC8o8wrTisWNbBijSL4A9JJTW4UavZNo93NqF4Mwgz2
Error: failed to fetch all nodes

@Stebalien
Copy link
Member

Stebalien commented Feb 12, 2018

Real zero length raw block is...

Good catch.

Report new bug?

The fact that we choke on a zero-length identity multihash is a bug in the multihash library. The fact that we don't currently support "inline blocks" is more of a missing feature (we only recently added the identity multihash and haven't added support for extracting blocks from it yet).

However, feel free to report both (we want inline blocks eventually anyways).


0x55 versus 0x37: You're right, the code is 0x55. I converted 55 into hex (but it's already hex).

@whyrusleeping
Copy link
Member

progress here?

@kevina
Copy link
Contributor Author

kevina commented Feb 16, 2018

Assuming the test pass I think this should be good to go. We can clean up the failed test case later.

@whyrusleeping
Copy link
Member

As @Stebalien has done most of the review here, i'd like to get his 👍 on merging

@magik6k
Copy link
Member

magik6k commented Mar 2, 2018

@whyrusleeping see #4680 (comment)

~Kubuxu

@Stebalien
Copy link
Member

Stebalien commented Mar 6, 2018

@whyrusleeping, @kevina see my comment here (unless I missed some change and this is no longer an issue).

(Also, @kevina, sorry for giving you bad information. I was unaware of @whyrusleeping's promise).

@kevina
Copy link
Contributor Author

kevina commented Mar 16, 2018

@Stebalien I pushed a commit which I think should address your concern.

@kevina kevina requested a review from Stebalien March 16, 2018 04:04
@Kubuxu Kubuxu self-requested a review March 16, 2018 09:20
@kevina kevina requested a review from magik6k May 31, 2018 05:08
@kevina
Copy link
Contributor Author

kevina commented May 31, 2018

@Stebalien I fixed the test case in case that was what you where waiting for. @magik6k a review will would be helpful from you also

}
// Its easier just to always use io.SeekStart rather than
// correctly adjust offset for io.SeekCurrent and io.SeekEnd.
_, err := r.base.Seek(newOffset, io.SeekStart)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if r.size == 1000, but r.base.Size() == 500 and we try to call r.Seek(700)?

It seems like this would error out, when it should succeed and subsequent reads should return up to 300 bytes of zeros. I guess i'm not sure what errors Seek returns when you try to seek past its bounds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it won't error out, from the documentation (https://golang.org/pkg/io/#Seeker):

Seeking to an offset before the start of the file is an error. Seeking to any positive offset is legal, but the behavior of subsequent I/O operations on the underlying object is implementation-dependent.

func (w *truncWriter) Write(p []byte) (int, error) {
truncC := 0
if int64(len(p)) > w.size-w.offset {
truncC = int(int64(len(p)) - w.size - w.offset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think your order of operations here is wrong:
lets say we try to write 10 bytes to a thing that has a size of 15, and a current offset of 12:

10 - 15 - 12 == -17

you want: int64(len(p)) - (w.size - w.offset)

truncC := 0
if int64(len(p)) > w.size-w.offset {
truncC = int(int64(len(p)) - w.size - w.offset)
p = p[:w.size]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you mean p = p[:w.size - w.offset] here

}

// Write implemented Write method as defined by io.Writer
func (w *truncWriter) Write(p []byte) (int, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add tests for this function once the issues here are fixed please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay done. Sorry about that. I wrote this code several months ago and i thought it was ready for review, I guess I was wrong as I did not double check things carefully enough.

@kevina kevina force-pushed the kevina/validate-size branch 2 times, most recently from 14856b3 to 77149fd Compare June 1, 2018 22:02
unixfs/unixfs.go Outdated
for _, blocksize := range pb.Blocksizes {
total += blocksize
}
if total != pb.GetFilesize() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, one case here is pb.Filesize == nil. Is Filesize required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double check, and if not I will add a check for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the function pb.GetFilesize() returns a uint64, if pb.Filesize == nil it will return 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my question was more: If Filesize is nil, that means it wasn't specified. Do we require Filesize to be specified? Should we calculate it from the blocksizes if left unspecified?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien given there must be a source of size, and given that blocksizes are optional, my hunch would be to just make Filesize hard-required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I'd like @whyrusleeping and @diasdavid to 👍 this.

@whyrusleeping, @diasdavid:

Will all IPFS files include a Filesize?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its probably fine to assume that any correctly formatted ipfs/unixfs file will contain a filesize. Seems pretty hard to do a lot of things without it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I added a test to make sure it is not nil.

unixfs/unixfs.go Outdated
// ValidatePB validates a unixfs protonode.
func ValidatePB(n *dag.ProtoNode, pb *pb.Data) error {
if len(pb.Blocksizes) == 0 { // special case
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also the case where there are no blocks but we still have inline data.

kevina added 11 commits June 12, 2018 04:24
If a node is too large, it is truncated; if it is too small,
it is zero extended.

License: MIT
Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT
Signed-off-by: Kevin Atkinson <k@kevina.org>
One failed test due to the use of a ill-formed unixfs node.

License: MIT
Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT
Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT
Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT
Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT
Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT
Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT
Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT
Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT
Signed-off-by: Kevin Atkinson <k@kevina.org>
License: MIT
Signed-off-by: Kevin Atkinson <k@kevina.org>
}
if len(dr.pbdata.Blocksizes) > 0 {
// if blocksizes are set, ensure that we read exactly that amount of data
dr.buf, err = newSizeAdjReadSeekCloser(dr.buf, dr.pbdata.Blocksizes[linkPos])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know, I think we can do something simpler here. Given that we know the Filesize of the buffer, we can require that buf export a Size() method. Then, we can just have a check here:

  • buf.Size() > blocksize -> make a limit seeker
  • == -> return as-is
  • < -> zero extend

That way, we can do all of these checks up-front. Thoughts? Too late?

@Stebalien
Copy link
Member

What currently happens if there are no blocksizes but the filesize doesn't match? Both with and without inline data? We may need to truncate/zero extend in this case as well.

Current rules:

  1. Filesize is required.
  2. If blocksizes are specified:
    1. They must sum to the filesize.
    2. There must be one for each block.
    3. If a blocksize is greater than the size of a block, zero extend.
    4. If smaller, truncate.
  3. If blocksizes are not specified:
    1. Fail (or truncate?) if the inline data < Filesize? NEW
    2. Truncate/zero extend the entire file to Filesize NEW

One thing is a bit inconsistent. If no blocksizes are specified, the filesize is used to truncate/zero extend. If they are specified, the filesize can't do this. We could also just truncate/extend always (i.e., don't check that sum(blocksizes) == filesize)...


So, I know this is kind of cold feet at the 11th hour however, after seeing all the weird corner-cases that come up in practice, I'm starting to wonder if we should fail more. That is:

  1. Filesize is required.
  2. If blocksizes are specified:
    1. They must sum to the filesize.
    2. There must be one for each block.
    3. Each block's blocksize must match the filesize specified in the block (or the implied filesize for raw leaves). Basically, when we load a block, we call GetFilesize on the block and check it against the size.
  3. If blocksizes are not specified: Keep a running tally of how much data we've read so far. If we hit an EOF before we hit the filesize, return a read error. Same if we hit the filesize before reading the end of the file. We can even do this at the block level: If we load a block that would cause us to exceed the filesize, fail; if the last block isn't big enough, fail.

The downsides are:

  1. We lose this truncation/extension technique.
  2. We change much of this PR (although, really, it should mostly just be code deletions).

  • Q: Ditch the truncate/extend and be strict?
    • Yes.
    • No. Q: Truncate/zero extend on sum(blocksize)/filesize mismatch?
      • Yes.
      • No. Q: Truncate or fail if inline data < filesize?
        • Truncate.
        • Fail.

@magik6k
Copy link
Member

magik6k commented May 14, 2019

I'm not sure what the status on this currently is, but:

  • This code now lives in go-unixfs
  • DagReader received a major refactor since this was opened

@magik6k magik6k closed this May 14, 2019
@Stebalien
Copy link
Member

This needs to be worked out at the spec level first. We jumped into coding too quickly, IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/review Needs a review status/in-progress In progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error: archive/tar: write too long Validate size in the DagReaders
7 participants