Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: representation of full uint64 range in & out of cbor #413

Merged
merged 1 commit into from
Jun 10, 2022

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented May 12, 2022

I don't think I'd dare do this without Eric's OK so it's a draft for now. This should allow us to access the full positive uint64 range of integers, although you can only get to it by type checking and accessing some custom methods.

I'm mainly thinking about this for bindnode, when you have a uint64 you can actually encode and decode the full range rather than overflowing the signed bit when casting to int64 for the highest range of integers. The awkward cast to the uintNode shaped type wouldn't be a big deal tucked away in bindnode and it's not really something we have to advertise publicly unless folks complain and want their full range.

This doesn't handle the negative uint64 range, which cbor can do, that would require some plumbing through refmt which currently errors when it encounters a negint below int64 range.

@rvagg
Copy link
Member Author

rvagg commented May 12, 2022

A better approach might be to leave plainInt unmolested and make a new plainUint that does what we need. It could even be reserved for the overflow cases only in the codec, with ints within the int64 range just getting an AssignInt(). So this is just a special case that can be sniffed for where it's really needed.

The existing TODO would still be a concern though, calling AsInt() will be an overflow. So maybe it should return an error in that case. Silently overflowing is probably not a good feature to keep baked in.

@rvagg
Copy link
Member Author

rvagg commented May 13, 2022

As per @hannahhoward's suggestion, I added a public interface like LargeBytes, called UintNode, with an AsUint() method. I've also refactored as per my last comment and I've put plainInt back to how it was (a simple boxed int64) but given it UintNode support. I've also added a uintNode type that implements the full uint path, with an internal uint64 and a positive bool. I've also used this type exclusively for the >int64 case, so a dagcbor decode will mostly still use a simple AssignInt() but above that range will AssignNode(NewUint(v, pos)). So in terms of perf, the main overhead now is the type check for UintNode on marshalling integers; otherwise existing code paths are mostly the same; we're just adding additional functionality.

I also did dagjson, but had to shunt that to a branch because I discovered refmt's json handling doesn't deal with TUint token types. This should be a relatively straightforward fix in refmt, but also a low priority I think. https://github.com/ipld/go-ipld-prime/tree/rvagg/uint64-dagjson.

@rvagg
Copy link
Member Author

rvagg commented May 30, 2022

I'm considering removing the possibility of negative uints from this - CBOR enables this, you can encode a negative uint and get it out safely, but of course at the language level we need a second signal to say that this value should be negative, and it's a little bit weird. If you're dealing with negative numbers and need to resort to this kind of thing, then you're probably already doing this at an application level and are unlikely to need it at the encoding level I suspect. You probably either already have a "is positive" boolean hanging around (and are likely even going to encode it in your data structure adjacent to the uint value), or it's always known to be negative and therefore no signal is needed.

In most languages we're used to not thinking about negative uints, we work around that in other ways (like just using signed ints).

@hannahhoward
Copy link
Collaborator

My primary comment was going to be about the support of negative uint64s, so I guess +1 to removing them. It seems odd.

Especially given we use refmt -- as I read the code, we don't actually even have access to the underlying CBOR in the marshal/unmarshal code, just the token stream, which also appears, by my read, to not really support the negative 2^64 range.

@hannahhoward
Copy link
Collaborator

finally I'm pretty sure cbor-gen doesn't support 2^64 negative ints, and compatibility with cbor-gen is the primary thing driving our use case here anyway.

@rvagg rvagg marked this pull request as ready for review May 31, 2022 06:49
@rvagg
Copy link
Member Author

rvagg commented May 31, 2022

I removed the positive bool from this, so it can only work on positive uints. One side effect of this is that a plain int node can't act as a UintNode—it could, but it would error every time it's negative, which becomes a problem in the codec which first needs to check for UintNode, then call AsUnit() and then for negative integers it has to deal with the error case; which is a bit too inefficient.

And now I have mixed feelings about this—it's not as universal as I hoped this would be, I can't treat any integer as a uint and just use the sign to cover the full range. There's now a much clearer bifurcation between integer types, even when you're dealing with a positive int that's within the uint64 range.

There remains the option of interpreting every positive integer out of the codec into a UintNode, but I don't know if that buys us much.

@rvagg
Copy link
Member Author

rvagg commented Jun 8, 2022

@hannahhoward @willscott want to give a (hopefully final?) review here?

I've made a trade-off choice here for decoding: we AssignNode(UintNode{v}) only when we encounter integers >MaxInt64. And the Int nodes that are within int64 range don't get to conform to the UintNode interface. One reason for this is that we get to take advantage of direct AssignInt() calls on builders for most positive integers, which should be more efficient in most cases. But it leads to asymmetries and quicks, but we have to pick a side on this:

  • only a the top part of the uint64 range will be able to act as a UintNode
  • something can be a Kind_Int but AsInt() errors
  • you can use a type check for UintNode to see whether these top-end values are there and available

An alternative is that a Int node implementer, such as basicnode.plainInt, could conform to the UintNode interface and error on AsUint(), but you'd lose the ability to use the type check to see if it's available, you have to call it an error instead.

So we get this set of conditions if we went the other way:

  • all integer nodes (from basicnode anyway) get to act as a UintNode (simple change to basicnode.plainInt).
  • something can be a UintNode but AsUint() errors
  • you can't use a type check for UintNode to see whether these top-end values are there and available, you have to check error on AsUint()

For now, I'm considering this mostly an internal concern, for bindnode, so keeping current code paths mostly unmolested is ideal. For other cases you're going to encounter errors out of range, at some point. e.g. in a codegen TypedPrototype build, the AssignNode(UintNode{}) is going to find its way to an AsInt() call somewhere which should error as being out of range (which is better than the current state because it overflows silently).

@rvagg
Copy link
Member Author

rvagg commented Jun 10, 2022

Too many branches stacking up on each other and this is at the bottom of them so I'm going to pull the trigger and merge 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🎉 Done
Development

Successfully merging this pull request may close these issues.

3 participants