Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

selectors vs selector dmt vs selector builders: confusing constructor proliferation #192

Closed
warpfork opened this issue Jun 9, 2021 · 5 comments

Comments

@warpfork
Copy link
Collaborator

warpfork commented Jun 9, 2021

There are several different types in circulation when working with selectors in go-ipld-prime. Some of this is essential complexity; some of it is not.

Of these types, they do very slightly different things. Some can be converted to others, but the conversations are not bidirectional. It's not clear which are meant to compose (and how) and which aren't. The ease of use is lower than it should be as a result of all this.

First, let's enumerate these different types and talk about them a bit. Afterwards, let's think about which of these we should emphasize or deemphasize when planning new APIs and writing documentation.

Enumerating

  • 1: the "executable" selector (selector.Selector)
  • 2: the "dmt" (has no type, it's just ipld.Node that has to have a certain data convention inside)
  • 3: the builder.SelectorSpec, which is a wrapper for # 2 and also has a builder method for # 1.

You'll instantly notice that # 2 isn't much of a type. It's pretty much our IPLD equivalent of golang's interface{} -- it's practically a wildcard. Nonetheless, this should count here, because selector.Parse()'s input parameter is... ipld.Node.

Why did we end up with all these?

  • # 1 (the "executable" selector) exposes the functions needed while actually executing.

    • Some preprocessing happens while constructing these, which saves time when executing.
    • The act of constructing these and getting a handle to this type in golang means it's been validated!
    • These are relatively lightweight to allocate, compared to making a general-purpose ipld.Node tree out of basicnode, which is what the "dmt" form (# 2) defacto is. This is important because Selectors sometimes return other Selectors during execution (e.g. recursion does this a lot).
      • This might be an outdated reason. Selectors were written before we had codegen, and way before we had the bindnode option available. We might be able to revisit this now.
  • # 2 (the "dmt", the declarative Data Model Tree defining the selector) exists... because it has to.

    • This is our "thin waist". This is how Selectors are actually defined, specified, serialized, and communicated.
      • The first step of getting a selector ready to go is you write it in JSON (or any other format there's an IPLD Codec for!), then parse that into Data Mode (aka ipld.Node), and then... you're here; now you call selector.Parse(node) with it.
      • ... or, you use # 3, below, but that does indeed still bounce through here, if briefly.
    • You actually have to go through here to get to # 1. We made a bunch of the fields in # 1 unexported, and the # 1 types are immutable in golang (and relateldy also have a lot of validity guarantees), because we routed all construction through selector.Parse.
  • # 3 (selector/builder.*) exists as a golang DSL to help create selector values.

    • It accumulates data, provides some helper methods that are specific to Selectors for doing it, and has functions that yield either # 2, or jumps through that quietly, calls selector.Parse on it, and returns you # 1.
    • It's a very golang-specific thing. It does reduce the syntactic volume required to create selectors. But if you're parsing a serialized selector spec, you'd skip this abstraction entirely.
    • (this may be a tangent) These also potentially have some historical flavoring which is worth being aware of
      • it predates a lot of our other developments on syntactic smoothness for wrangling Data Model data. It predates almost everything in the fluent/* package family. We might be able to do this better now.
        • (this is extremely dissentable, as you'll see in the replies.)
      • it also predates the whole NodeAssembler revamp (that's way back in v0.0.3!), and correspondingly it's... a little funny shaped, compared to how we'd probably do it now. And it's pretty allocation-heavy, too (in ways that NodeAssembler approaches were designed to fix).
      • maybe some golang DSL like this is useful. I just wanna be clear that this one has a lot of historical baggage at this point, so it's probably worth taking it with a grain of salt. It's full of choices that made sense at the time, and "the time" is now a good while ago.

What should we be emphasizing?

if we can unify 1 & 2...

... then we probably should do that. It makes a whole lot of complexity disappear in a puff.

Also, as mentioned earlier, this might be a lot more tractable now than it was when all of our Selectors code was written. We have codegen now, and/or bindnode, both of which provide ways to use memory more efficiently, making unification of the executable selector and the dmt form potentially viable. So, there's good reason to do a mental cachebust and reconsider if this is possible.

I suspect the need for # 3 (the builder.SelectorSpec stuff) would disappear entirely too, if we could unify # 1 and # 2. There'd just be no more demand for it.

A couple of things to watch out for in this, though:

  • Would there be performance regressions? (It's possible the answer is "no" now, but it's not 100% certain without checking.)
  • Would we lose the "selector.Selector means it parsed" semantics? We do currently have features like ExploreRecursive will check to make sure it contains an ExploreRecursiveEdge while we're in the process of going from ipld.Node->selector.Selector, and that's an example of something we can't replicate with Schema-based validation alone. We'd have to decide how critical this is. (I tend to lean pretty far towards "make the compiler help you enforce invariants about when things are known to be valid", but people may disagree about the cost/benefit here.)
  • Is there a chance we're going to increase the future defect rate in the Selector implementations by putting the executable form and the serializable form in the same types? Maybe. (I think this is probably a low probably problem in this scenario, because the selector.Selector types are already very immutable in their usage. If that wasn't the case, this would be a much bigger concern. I comment on this mostly for the sake of anyone reading this in the future and looking for guidance on their own designs for other systems; it may be a concern for you, future hypothetical reader.)

... if we can't unify them:

... then we should still emphasize # 2 (the "dmt" form) as the central focus for most of our APIs.

The transition from dmt->executable is still one way, so composability is best served by emphasizing the dmt form.

(And the # 1 / executable form is also still not actually composible, itself, anyway -- again, on purpose -- because we made a point of disallowing ExploreRecursiveEdge without an appropriate parent, which has the additional affect of making it impossible to construct these from the leaves up.)

but we should still get a golang type, so this feels better.

I think a problem that I'm observing from how folks are using these in the wild is... people are very reticent to hold onto or pass around the "dmt" form -- the ipld.Node value, in golang -- because it feels icky. It feels untyped. It's not obvious how useful the ipld.Node value with this data is, and it's not very discoverable how to use it by just looking at library API type signatures.

That's a bummer. It also makes total sense. I'd get similar feelings looking at this. But it's a bummer, because it's nudging people away from using the thing we'd most like for them to use.

We can observe that sometimes this results in worse code when it's not obvious that the "dmt" form might be worth hanging onto: here is an example where selector builder code was copy-pasted between projects, because the upstream only kept the fully compiled selector (which is noncomposable).

But we have solutions for this. Either codegen, or use of the new bindnode system, would let us have a named golang type for the selector dmt form. That should then help with the obviousness problem -- library docs get much more discoverable when there's a more specific golang type name available.

(We could also just use type aliases to make the library API documentation more discoverable... but this would not result in a lot of (scratch that: would not result in any) actual compile-time safety, so it's much less desirable than the codegen or bindnode options, which do give some compile-time safety in addition to a named type.)

what about selector/builder.*?

I think we should avoid using these in the signatures of library functions that compose selector spec hunks.

It's perfectly innocuous to continue to use these in applications. But libraries are a different matter.

It's not that the builder system is bad, it's... just that the thing it is is a golang DSL. Building more library layers above it runs an increasing risk of making design errors where a design choice is nontranslatable to other languages, or starts to significantly diverge from the serial form. Building more library layers above this would also likely just become too many layers of indirection (DSLs can be great, but only when user-facing; something wrapping them is usually the beginning of a nasty complexity flowering).


Sidenote: I think if we wrote all this today, I'd probably have called selector.Parse(ipld.Node) (selector.Selector, error) as selector.Compile instead. That evokes a lot more of the right connotations (such as it moving you towards "ready to run", and away from a "source" form, and, making that move in a probably not-entirely-reversable way).

Ah well. Hindsight helps everything. Naming is no exception to that rule.


I'm not sure if this is an issue I wrote up with the expectation of immediate action in mind. But I thought it would be a useful hunk of thought to have as other discussions about selector ease-of-use are started. It was also fuelled in part by conversations today about APIs which have tried to use builder.SelectorSpec.

@hannahhoward
Copy link
Collaborator

If we're going to de-empahsizing the builders, we need to provide something equally user friendly for making selectors. The main thing I can see producing that is something built on bindnode once it's done.

There is a general misconception here that working with qp and fluent get you to a point of usability that is actually workable for everyday programmers building IPLD data structures in code. In fact, what I see in practice is that people sidestep the builders entirely to work with go structures inside codegen'd nodes inside the package they're generated in, and provide public methods to make things easier. Example:
-- https://github.com/filecoin-project/dealbot/blob/main/tasks/storage_deal.go#L404
-- https://github.com/filecoin-project/dealbot/blob/main/tasks/client.go#L6
-- https://github.com/filecoin-project/dealbot/blob/main/tasks/client.go#L35

I understand the reasons for the builder interfaces, but we need to be realistic that we're several steps away from something that can be used to build data structures easily in everyday code.

@mvdan
Copy link
Contributor

mvdan commented Jun 9, 2021

I agree with Hannah - Node and Assembler are powerful interfaces, but just don't compare to plain Go expressions for most use cases. I also sidestepped the interfaces in the HAMT ADL: https://github.com/ipld/go-ipld-adl-hamt/blob/b2c8ddbaab060c3deb13863eb9f37b9542992f64/node.go#L175

@warpfork
Copy link
Collaborator Author

warpfork commented Jun 9, 2021

I really want to be clear that I don't think the builders are bad.

I'm just really worried that building more library things on top of them is putting DSLs on a DSL, and I'm concerned that in turn will cause us to spend lots of effort polishing and refining something that, because of the additional layers of indirection, will be dangerously likely to evolve further and further away from the declarative, serializable stuff at the core.

And it's the declarative serializable stuff at the core that we should care about the most, because that's what's relevant to:

  • any other language's implementation,
  • and any end users who author these things.
    • ... at least, until we get some other end-user-facing tools which take yet other DSLs, and transpiles them into Selector DMT. Which, while a perfectly beautiful idea, does also continue at present to be hypothetical.
      • and this story still routes through the declarative serializable DMT, so that's still what remains most critical.

If an application author wants to use golang DSLs for writing their selectors -- fine.

But what about applications where users supply the selector from the command line (and we eschew #189 for now, or say they want the "full power" mode so they're using the full DMT JSON)? That's a totally different story. And I think in this story, the DSLs shouldn't be in the processing path at all, because they just don't add any value. selector.Parse(ipld.Node) (Selector, error) is sufficient.

And most of the helpful functions I'd want to add are probably in the territory of selector.ParseJsonSpec(io.Reader) (Selector, error) (which would just glue a json decoder and the selector parse method together, and call it a day -- half the point of the function is just to increase the obviousness that that can be done).

I'll edit the original post a bit to try to be clearer on this.


w.r.t. the "Either codegen, or use of the new bindnode system, would let us have a named golang type for the selector dmt form." part of the original post: I should clarify a bit more why I'm concerned about that:

I'm not necessarily claiming it'll add all the helper methods one would ever want, or that it negates any possible value of having a builder DSL. (Ya'll probably right, there's still journey to go there.)

I'm claiming having a named type would make it easier to make library APIs more self-documenting and navigable, and encourage people to hold onto and pass around the right values.

The reason I'm concerned about this is another piece of code @ribasushi brought up yesterday: https://github.com/filecoin-project/lotus/blob/efeaea76b97fd52203122ee33a7633da74ba9631/node/impl/client/client.go#L738-L739 -- what happened here is that it was insufficiently obvious to folks that storing the ipld.Node might actually be viable and meaningful, and that resulted in copypasta. Now, there's a couple of different ways this could've gone which would've avoided the copypasta, but that "just hang onto the data model tree" seemed to have a SEP field around it is scary. (I'll edit the original post to include this link, that part is a bit vague without the example.)

@warpfork
Copy link
Collaborator Author

After thinking about this for a while: One of the things I want to do to help guide users to the right entrypoints for all these APIs is probably a package layout refactor. Notes on that in #236 .

@BigLep
Copy link

BigLep commented May 10, 2022

2022-05-10 conversation: we don't expect this to get done unless Bedrock takes it on. Bedrock work for selectors will happen in #236 (if it happens at all).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants