Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

codec/dagcbor: add DecodeOptions.ExperimentalDeterminism #390

Merged
merged 1 commit into from
Apr 5, 2022
Merged

codec/dagcbor: add DecodeOptions.ExperimentalDeterminism #390

merged 1 commit into from
Apr 5, 2022

Conversation

mvdan
Copy link
Contributor

@mvdan mvdan commented Mar 17, 2022

(see commit message)

@mvdan mvdan requested review from warpfork and rvagg March 17, 2022 16:51
@mvdan
Copy link
Contributor Author

mvdan commented Mar 17, 2022

We did apply #386 by default and forcefully, but I imagine that one is extremely unlikely to break any users, whereas this one isn't.

@mvdan
Copy link
Contributor Author

mvdan commented Mar 17, 2022

The bad input is {"c": -17, "0": -17}, i.e. the keys are out of order.

@warpfork
Copy link
Collaborator

warpfork commented Mar 17, 2022

Yeah, I don't think we could do this by default.

In general, Postel's principle seems to still be a good rule of thumb for us to follow, even in IPLD in contexts where we do care about canonicality more than most folks do.

  • Strict modes available, when used as a library: good;
  • Strict modes being the default, esp being what happens when e.g. looking up a thing by multicodec indicator code: somewhere between "risky choice" and "nonviable". (In DAG-CBOR's case: probably the latter.)

(Edited: to put the more general comments first, for ease of readability as future "case law".)

In DAG-CBOR in practice in particular:

There's way too much data in the wild with noncanonical orders of map keys. (The genesis block of filecoin is one of them... not that that particular example should have overwhelming weight -- it's most certainly one of many -- but it's certainly amusing.)

Noncanonical order is also extra-likely-in-practice for DAG-CBOR just because of the bizarreness of the CBOR RFC's choice of sorting algorithms; it catches people off-guard all the time. That contributes significantly to the amount of data in the wild with noncanonical order, as well as the odds of new such data appearing in the future when other people implement new libraries and systems.

We have also already found other places where a noncanonical serial data in the wild became sufficiently common that we ended up deciding to accept it, for backwards compatibility's sake: for example, the CBOR undef byte, which we now quietly coerce to a null in the data model. (Decisions in: go; js; and even older go.) With any such secession, the horses have fully bolted the barn already, and there's very little utility on trying to get some of them back in.

If we were looking at a codec that we had fully specified from scratch, and from its first day of being it had a single canonical form and all of its implementations (and any not-quite implementations, e.g. the relationship CBOR has to DAG-CBOR) were also strict from their first day... we'd be able to consider aggressively defending that ground. But unfortunately, that's not the scenario we have with DAG-CBOR. (Nor with DAG-JSON, if another similar example is useful to compare with.)

@warpfork
Copy link
Collaborator

(edited previous comment, to make it more legible as general "case law" for future reference.)

@mvdan
Copy link
Contributor Author

mvdan commented Mar 20, 2022

FWIW I fully expected this default to be rejected, hence the failing tests and the "DO NOT MERGE".

Tomorrow I'll update the PR to keep the strict mode opt-in on decoding, and update the fuzzer to not require the codecs to follow their own specs by default.

One open question is what should the option be: "enforce spec strictness on decode" to do everything at once (map key order, canonical form for floats and integers, etc), or one option for each one of those separately, just like MapSortMode? I lean towards the former, because I can't imagine anyone would want a decoder that is "a little stricter but not fully strict". cc @rvagg

@mvdan
Copy link
Contributor Author

mvdan commented Mar 20, 2022

Also, I know we've talked about this before, but: I think it's time to reconsider what the specs dictate. It's not good to have the dag-cbor core spec dictate strictness on decode when all current implementations do the opposite. The spec should be updated to reflect reality, given that it's not possible to update the defaults to reflect the spec.

@BigLep
Copy link

BigLep commented Mar 23, 2022

@mvdan : will create the spec update PR?

@rvagg
Copy link
Member

rvagg commented Mar 28, 2022

2c:

  • I pushed toward strictness everywhere for a while but we ended up at this static state of accepting that there's just too much data in the wild that doesn't conform, as Eric outlined (this is even true of dag-pb, we can't even be completely strict with that!).
  • The spec might lean toward suggesting strictness should be the default because it might be reflecting the directionality I was working toward and might need backing off a bit to reflect reality.
  • I still would like an opt-in strictness in all our codecs--there will be systems that people build that are isolated enough that they can enforce strictness and get the benefits from that (there's tradeoffs of course). I worked toward that in the JS codecs too, they're about as close as the Go ones to being able to enable some kind of switch.
  • The approach I took in JS re options was to put a verbose suite of options in the base codec handling code (cborg) but then take strong opinions in the interface for our IPLD codecs (see https://github.com/ipld/js-dag-cbor/blob/b34bf67543c91975d0afe051e50fa1d6eb66a35a/index.js#L86-L98) and I think I'd go a similar direction for JS if we exposed a strict-mode of some kind, just one switch that turns on all the decode strictness. But then, since it's layered, if people wanted to get picky, then they can pull it apart and do it themselves. We might be able to find similar opportunities here, or maybe we bundle it into the one suite of options--have options for all the different bits, but then one additional option "strict decode" that overrides them all. The API might get a bit too noisy and confusing though if there's a flat options space.

@mvdan mvdan changed the title codec/dagcbor: require map key order on decode codec/dagcbor: add DecodeOptions.StrictDeterminism Mar 29, 2022
@mvdan
Copy link
Contributor Author

mvdan commented Mar 29, 2022

I've rebased this and updated the change to keep the same default and make the option a general "be strict" boolean.

PTAL :)

@mvdan
Copy link
Contributor Author

mvdan commented Mar 29, 2022

@mvdan : will create the spec update PR?

Filed ipld/ipld#196 so we don't forget. I haven't sent a PR as it's not a trivial change, and it's also not clear to me e.g. whether we want to apply the same thinking to other codecs.

@rvagg
Copy link
Member

rvagg commented Mar 30, 2022

I've now got mixed feelings about this because it's so incomplete for something called "strict determinism".

Reviewing the list from the top of my head to see how deep this can go:

  • floats must be 64 bits long
    • there may be a case for doing a round-trip on floats to ensure they use the canonical representation .. I'm not sure about this, it might be unnecessary but ieee754 has lots of weird spaces
  • ints and negints must be small-as-possible
  • all length-prefixed items must use small-as-possible ints - strings, bytes, maps, arrays, tag (and only 42 of course - must be <tag><int 42>)

Then there's all the little "you can't put that in there!" items, some of which refmt I think might reject already but we'd need to check:

  • no novel major 7 (float) items, including whatever our latest position on NaN is (did we include it? if so, we should only allow the single correct form of it)
  • no indefinite length items, and no break tokens

Most of this reaches down into refmt.. but you get the idea that having something called "strict determinism" that only does map key ordering and nothing else is a bit weak.

@mvdan
Copy link
Contributor Author

mvdan commented Mar 30, 2022

This is definitely incomplete, but a first step adding the API. I see two options:

  1. Merge the API, documenting that it is incomplete.
  2. Merge the change but keep the API hidden until finished.

2 is the most correct option, as you point out, but I also worry that it might be one of those things in go-ipld-prime that doesn't quite get finished in a long time, for reasons like having to reach into refmt. So my personal take is that having something as an end user is better than nothing :)

Perhaps a middle ground is to call it ExperimentalDeterminism, documenting that it is incomplete and will eventually be replaced by StrictDeterminism.

@rvagg
Copy link
Member

rvagg commented Mar 30, 2022

ExperimentalDeterminism - yes, I like this idea, sends the right message. Maybe link back to this thread in the docs even to note what still needs to be done.

This is simialr to Rod Vagg's EncodeOptions.MapSortMode,
which taught the encoder to canonically sort map keys as per the spec,
but now we're doing the decode side.

We add a single "strictness" knob, as it needs to deal with multiple
bits like integers and floats, and not just map key order.

Update the fuzzer to be aware of this setting.
@mvdan mvdan changed the title codec/dagcbor: add DecodeOptions.StrictDeterminism codec/dagcbor: add DecodeOptions.ExperimentalDeterminism Apr 4, 2022
@mvdan
Copy link
Contributor Author

mvdan commented Apr 4, 2022

Done, PTAL

@mvdan
Copy link
Contributor Author

mvdan commented Apr 4, 2022

I didn't add a link to this thread because the TODO already links to #389

@mvdan mvdan merged commit 81f4ec6 into ipld:master Apr 5, 2022
@mvdan mvdan deleted the dagcbor-decode-strictness-map-keys branch April 7, 2022 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🎉 Done
Development

Successfully merging this pull request may close these issues.

5 participants