Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cid implementation variations++ #72

Merged
merged 1 commit into from
Aug 30, 2018
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion _rsrch/cidiface/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ There are quite a few different ways to go:
- Option C: CIDs as an interface with multiple implementors.
- Option D: CIDs as a struct; multihash also as a struct or string.
- Option E: CIDs as a struct; content as strings plus offsets.
- Option F: CIDs as a struct wrapping only a string.

The current approach on the master branch is Option A.

Expand All @@ -42,6 +43,10 @@ the binary format of the cid internally, and so could yield it again without
malloc, while still potentially having faster access to components than
Option B since it wouldn't need to re-parse varints to access later fields.

Option F is actually a varation of Option B; it's distinctive from the other
struct options because it is proposing *literally* `struct{ x string }` as
the type, with no additional fields for components nor offsets.

Option C is the avoid-choices choice, but note that interfaces are not free;
since "minimize mallocs" is one of our major goals, we cannot use interfaces
whimsically.
Expand Down Expand Up @@ -72,6 +77,7 @@ When using `*Cid`, the nil value is a clear sentinel for 'invalid';
when using `type Cid string`, the zero value is a clear sentinel;
when using `type Cid struct` per Option A or D... the only valid check is
for a nil multihash field, since version=0 and codec=0 are both valid values.
When using `type Cid struct{string}` per Option F, zero is a clear sentinel.

### usability as a map key is important

Expand All @@ -82,6 +88,7 @@ We already covered this in the criteria section, but for clarity:
- Option C: ~ (caveats, and depends on concrete impl)
- Option D: ✔
- Option E: ✔
- Option F: ✔

### living without offsets requires parsing

Expand All @@ -106,7 +113,7 @@ How much this overhead is significant is hard to say from microbenchmarking;
it depends largely on usage patterns. If these traversals are a significant
timesink, it would be an argument for Option D/E.
If these traversals are *not* a significant timesink, we might be wiser
to keep to Option B, because keeping a struct full of offsets will add several
to keep to Option B/F, because keeping a struct full of offsets will add several
words of memory usage per CID, and we keep a *lot* of CIDs.

### interfaces cause boxing which is a significant performance cost
Expand All @@ -129,6 +136,23 @@ if a situation is at the scale where it's become important to mind whether
or not pointers are a performance impact, then that situation also
is one where you have to think twice before using interfaces.

### struct wrappers can be used in place of typedefs with zero overhead

See `TestSizeOf`.

Using the `unsafe.Sizeof` feature to inspect what the Go runtime thinks,
we can see that `type Foo string` and `type Foo struct{x string}` consume
precisely the same amount of memory.

This is interesting because it means we can choose between either
type definition with no significant overhead anywhere we use it:
thus, we can choose freely between Option B and Option F based on which
we feel is more pleasant to work with.

Option F (a struct wrapper) means we can prevent casting into our Cid type.
Option B (typedef string) can be declared a `const`.
Are there any other concerns that would separate the two choices?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The struct wrapper allows us to add additional metadata/offsets later if we want.

Copy link
Member Author

@warpfork warpfork Aug 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's "Option D" or "Option E" though, respectively.

Tbh, I think basically all of these are equal in terms of how easy they are to switch to any of the others, later. Switching from typedef-string to struct later would be a "breaking change" in terms of how a linker things about the raw assembly... but it will not be a breaking change syntactically‡, and that's what matters to us as humans estimating refactor costs.

[‡] Presuming we make all accesses outside the cid package itself via public functions, and define a cid.Zero. Which would be the plan, I think. The only thing that makes this first big refactor a pain is the syntactic change of removing the * and fixing all the == nil occurances.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People would still compare against "" and cast to a string/bytes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The former should be rejected in code review based on the existence of cid.Zero and introduced universally when we do the refactor; the latter would be a bug.

If we're seriously worried about these, then we should 100% hands-down prefer Option F over B, because more than precisely zero instances of casting to Cid from outside the cid package would strike me as acceptable code quality.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The former should be rejected in code review based on the existence of cid.Zero and introduced universally when we do the refactor; the latter would be a bug.

That's not an assumption we can make, generally. Users (not in our projects) will cast. It's really the primary reason to use a type alias over a struct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the choice of the constant I am going with Nil. My second choice would be Empty, and Zero the third. Nil to me is neutral and also implies an invalid value, of which an empty Cid{} struct is (it can't be used for anything). Empty to me implies a valid but empty container, but I am okay with it if Nil is causing that much of a problem. Zero to me implies a numeric value, of which a Cid is not. I understand the notion of a zero-value in compiler terminology, but Zero still strikes me as an odd choice, ZeroValue may be a bit better, but if Nil is no good I still rather go with Empty.

I didn't realize that using struct initializers in the middle of equality was so verbose. However, it was never my intent to use them in equality. Instead I provided an IsNil method as c.IsNil() is nicer than c == cid.Nil or c == (cid.Cid{}). However elected to still provide var Nil = Cid{} so it can be used if desired.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to make speculations about whether comparison to a var is going to be a performance issue, then shouldn't proposing an equality method have at least as much burden to prove itself a non-issue before we consider that a confirmed design choice?

I disagree completely about naming a variable of non-pointer type "Nil". That is extremely unidiomatic Go because it conflicts with the language's own definitions of the term, and in discussion elsewhere I've already linked the part of the language spec which defines zero versus nil. But this PR is probably not the topical place to relitigate this naming, either, seeing as how the term is not used anywhere in the diff body we're discussing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm strongly in favor of Nil as a constant. While Zero is technically correct, it's confusing.

However, as a method, I'd be fine with IsEmpty (or IsZero but I'd rather not).

As for the method, golang is pretty good about inlining one-line methods so it should be fine. After doing a bit of decompiling, it looks like we should go with:

func (c Cid) IsNil() bool {
    return c.string == ""
}

c == (Cid{}) seems to cost us an extra xor zeroing something out. c == Nil costs us a mov.

See: https://go.godbolt.org/z/CHyhZl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien the IsNilString one looks wrong. The compiler is getting really smart there and looks like its comparing the empty string to itself, and calling that the output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(discussed on slack)

The assembly was correct. The generated instruction was testq AX, AX which tests AX & AX to see if it's non-nil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to speculate about if one would be easier/faster for refmt to munge than the other, because struct-containing-string seems like it going to require at least a few more reflection calls...

But on second thought that probably doesn't matter at all; CIDs are going to be using transform funcs either way because the serial type is either bytes (which is a different reflect.Kind than string, so needs transform) or base-$n$ encoded string (which... needs to be de-base-$n$'d, so that's a transform). So it's the same.

Are there any other things that would make B/F distinct? 🤔


### one way or another: let's get rid of that star

We should switch completely to handling `Cid` and remove `*Cid` completely.
Expand Down