Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State that implementations SHOULD accept a schema retrieval IRI / initial base IRI #1299

Closed
handrews opened this issue Sep 19, 2022 · 31 comments

Comments

@handrews
Copy link
Contributor

handrews commented Sep 19, 2022

While we discuss initial base IRIs, at no point do we make it clear that it's advantageous to allow an application to supply a retrieval IRI or other IRI for use as a base IRI in the absence of an absolute $id. Implementations vary with respect to how they handle this (or don't).

This strikes me as a SHOULD requirement rather than a MUST. It's advantageous to allow, but JSON Schema is usable without it, and the exact mechanism is not something we should specify. It's also conceivable that a specialized implementation would have a reason to skip this on the grounds of minimizing code or knowledge that external IRIs of this sort will not be useful in its intended execution context.

CLARIFICATION: "Initial base URI" is not the same thing as RFC 3986 §5.1.4's "Default base URI". An initial base URI can come from any of the sources described in RFC 3986 §5.1.2 - 5.1.4. This issue doesn't have anything to do with setting a broader §5.1.4 default base URI. Several comments related to that concept have been marked off-topic.

@gregsdennis
Copy link
Member

gregsdennis commented Sep 19, 2022

For example of this implemented in the wild, see the playground at https://json-everything.net/json-schema. There is an option that allows the user to specify a default base URI.

@mwadams
Copy link

mwadams commented Sep 19, 2022

Agreed. I have been toying with this thought.

I have a "rebase as root" option which allows you to take a JSON island and rebase it as if it were an isolated document root; and I also have a "don't rebase this but treat this island as a base schema embedded in a non-schema document" (which is essentially the "OpenAPI" case).

But I don't have a "use this iri as the base for a schema, instead of the (potentially local filesystem-based) iri you actually came from.

@gregsdennis
Copy link
Member

I definitely think that we need to include language that states this is only a fallback for when $id is absent. $id MUST NOT be overridden.

@handrews
Copy link
Contributor Author

@gregsdennis good point. Should be obvious, but I am well aware that hardly anyone actually reads RFC 3986 no matter how much I xref it 😛

@handrews
Copy link
Contributor Author

handrews commented Sep 21, 2022

[EDIT: The requirements I wrote here were wrong, better ones forthcoming in a new comment.]

@handrews
Copy link
Contributor Author

OK, thinking this through again and hopefully correctly this time, I think the requirements are:

  • SHOULD support accepting a retrieval/initial base IRI (RIB IRI) for a schema resource
  • MUST consider the retrieval IRI of an embedded resource to be the JSON Pointer fragment IRI using the base IRI of the enclosing resource
  • MUST set the RIB IRI as the base IRI for a resource prior to processing its root schema object
  • If the resource root has $id, MUST set the base IRI (and canonical IRI if we keep that distinction) for the resource to the resolved $id <-- Greg, this is the "do not override" rule

Note that with this approach, embedded schema resources and schema documents work the same way in terms of retrieval IRIs. Technically, this would mean that we don't need to specify that part of the behavior as part of the $id keyword. It's just normal retrieval-IRI-as-initial-base behavior.

The point I'm less clear on is whether the retrieval IRI needs to be associated with the schema resource root. This would allow it to be used in $ref even if it is not the canonical IRI (or otherwise setting the base IRI for the resource contents).

I would say that it is a bad practice to use such an IRI in $ref in general, but if an implementation does on-demand reference resolution with network fetching, then it's going to be using a retrieval IRI by definition. So I guess there are three cases of resources with differing $id and retrieval IRIs:

  • schema document provided to the implementation with a retrieval IRI
  • embedded schema resource, in which case the single case of a "shadowed" (resource boundary-crossing) JSON Pointer IRI pointing to the location where the resource is embedded could be usable alongside the canonical IRI, even if shadowed JSON Pointers are otherwise unsupported. This case is really more overlapping resource boundaries rather than crossing them.
  • schema document retrieved on demand from a schema that by definition knows the retrieval IRI, but might not be aware of the internal $id value to use in references otherwise

@jdesrosiers

This comment was marked as off-topic.

@handrews

This comment was marked as off-topic.

@handrews

This comment was marked as off-topic.

@jdesrosiers

This comment was marked as off-topic.

@handrews

This comment was marked as off-topic.

@handrews

This comment was marked as off-topic.

@jdesrosiers

This comment was marked as off-topic.

@handrews
Copy link
Contributor Author

handrews commented Sep 26, 2022

Nothing in this issue changes any functionality or has anything at all to do with how RFC 3986 is applied.

§9.1.1. Initial Base IRI describes that an initial base IRI (an IRI against which a relative IRI-reference $id will be resolved, or which will be used as the resource's base IRI if the resource root does not contain an $id) can come from various RFC 3986-defined sources.

It does not have any normative language indicating an implementation requirement for accepting such an external initial base IRI. There is an implicit requirement in the description of how such a thing can be used. All this issue says is that we should make the existing implicit requirement explicit and normative.

@jdesrosiers has informed me that he considers such a change to contribute to bloat in the spec, and that §9.1.1 repeats too much from RFC 3986, but has agreed that his concerns about bloat can be addressed separately from this issue. I happen to agree with him that §9.1.1 can be reduced substantially, but that is not the topic for this particular issue.

All of our implementation requirements need clear and explicit normative language. Nothing in this proposal changes anything about how RFC 3986 is applied. Therefore, I am marking all comments that assume that this issue proposes such changes to be off-topic. I have updated the original comment so that it is more clear what is within the scope of this issue.

@handrews
Copy link
Contributor Author

Thinking further on this, can anyone come up with a reason that this needs to be a SHOULD and can't be a MUST? What are valid reasons for not being able to accept an initial base IRI (retrieval or otherwise)?

An implementation that cannot accept an initial base IRI will be unable to process schemas that use non-fragment-only relative references. At least in general, and the effort involved in figuring out if you can resolve relative references without a base IRI can only really be done within a single schema document.

I'm inclined to do the following:

  • SHOULD MUST support accepting an initial base IRI (RIB IRI) for a schema resource. This initial base IRI should be determined per RFC 3986 rules, but we can't enforce that (and it's occasionally useful to do something different anyway)
  • MUST consider the retrieval IRI of an embedded resource to be the JSON Pointer fragment IRI using the base IRI of the enclosing resource This just clarifies what is already true, in case an implementation wants to break apart a compound document
  • MUST set the RIB IRI as the base IRI for a resource prior to processing its root schema object
  • If the resource root has $id, MUST set the base IRI (and canonical IRI if we keep that distinction) for the resource to the resolved $id This is already covered by $id, it's not a new requirement at all

@awwright
Copy link
Member

awwright commented Oct 6, 2022

at no point do we make it clear that it's advantageous to allow an application to supply a retrieval IRI or other IRI for use as a base IRI in the absence of an absolute $id.

can anyone come up with a reason that this needs to be a SHOULD and can't be a MUST?

Technically, the requirement to use a base URI/IRI, even if it must be provided out-of-band, is already specified in RFC 3986/3987, and is incorporated by normative reference; so I think MUST or SHOULD would technically over-constrain/redundantly constrain the specification. However, some implementation guidance would be warranted here (with lowercase "should", and also pointing out that if you don't provide this option, some schemas cannot be parsed, as you point out).

@handrews
Copy link
Contributor Author

handrews commented Oct 6, 2022

so I think MUST or SHOULD would technically over-constrain/redundantly constrain the specification

No, RFC 3986 tells you how the base URI is to be calculated. It does not impose any requirements on how an implementation of anything does or does not have to accept a base URI. As currently written, it is entirely compliant with both RFC 3986 and JSON Schema to not accept a base URI/IRI at all.

JSON Schema is the only specification that controls what JSON Schema implementations are required to offer. If we don't require this, it's not required. There is no reason to make it a lowercase "should." An uppercase SHOULD is correct as it facilitates interoperability by ensuring that changing from one implementation to another won't result in a loss of base URI-setting functionality.

It's debatable as to whether it ought to be a SHOULD or a MUST, and the question there is what is the downside of making it a MUST? What are the conditions under which it is advantageous to not accept a base URI? If there are significant situations where it is advantageous, that makes this a SHOULD. If there are not, then it is better as a MUST, again to ensure that different implementations offer a consistent feature set.

@handrews
Copy link
Contributor Author

handrews commented Oct 6, 2022

And yes, the consistency issue is real. I filed this after surveying all implementations listed on our implementations page and realizing that implementations are horribly inconsistent in this area, and often not even well-documented. It's clear that there are real, practical problems caused by the under-specification of this behavior.

@awwright
Copy link
Member

awwright commented Oct 6, 2022

Isn't this implicit? A base URI is required to resolve a URI Reference; and this base URI might be a document URI that's only known out-of-band; therefore, if a validator doesn't accept a default base URI, some URI References will be unresolvable, or will use the application-specific base URI.

If we want to avoid using the application-specific URI, then that would be a good excuse to use a SHOULD, I think.

@handrews
Copy link
Contributor Author

handrews commented Oct 6, 2022

if a validator doesn't accept a default base URI, some URI References will be unresolvable, or will use the application-specific base URI.

The point is that people have implemented the spec without the ability to set an external (e.g. retrieval URI or application-defined base URI), which leads to failure to resolve relative references, which leads to user confusion when they can't get their schemas to work. All of which is 100% compatible with RFC 3986 and JSON Schema as written.

This makes the implementation less usable. The simple solution is to allow specifying such a base URI. It is objectively verifiable fact that implementers do not consistently understand this, and therefore do not offer the ability consistently. That is an interoperability problem if you are relying on this totally reasonable RFC 3986-compliant process.

The simple solution to this interoperability problem is normative language. There is no normative language in either RFC 3986 or JSON Schema to guide implementations. So let's put it in there. Why is this such a problem? Please explain to me the actual, measurable downsides.

@jdesrosiers
Copy link
Member

I just filed #1322 about dropping the "initial base URI" concept in favor of standard RFC-3986 terms. I think that should be resolved before this can be decided since this is based on the "initial base URI" concept.

@handrews
Copy link
Contributor Author

handrews commented Oct 6, 2022

@jdesrosiers for this issue the only thing that matters is whether a JSON Schema implementation can accept a base URI to use or not.

I am still waiting for anyone to explain to me why it is advantageous to maintain the status quo that has resulted in implementations not accepting such a base URI, and apparently not being aware that it's a thing. Or why it would be harmful to throw in a SHOULD for this.

Folks, please give me literally anything to work with here. "RFC 3986 implies this" is not enough, as that does not clarify implementation requirements for JSON Schema. RFC 3986 defines a process, not implementation requirements.

@jdesrosiers
Copy link
Member

for this issue the only thing that matters is whether a JSON Schema implementation can accept a base URI to use or not.

I disagree and I explain why in #1322. It makes a difference whether we suggest implementations take an initial base URI vs a retrieval URI vs a default base URI.

I am still waiting for anyone to explain to me why it is advantageous to maintain the status quo that has resulted in implementations not accepting such a base URI, and apparently not being aware that it's a thing.

I addressed this in #1299 (comment). You insisted that it was "off-topic". We are not maintaining the status quo. This issue was addressed in the UJS documentation a year ago. A year is very new on these timescales. I think we just need to give it enough time to have an effect. Most implementations were written before that was released.

"RFC 3986 implies this" is not enough, as that does not clarify implementation requirements for JSON Schema.

Yes, RFC-3986 puts constraints on what you can implement, but doesn't tell you what you should implement. I just don't think it's necessary for the spec to prescribe or even suggest that implementations provide this feature. I agree with your assertion that the issue is largely implementers "not being aware that it's a thing". That problem is now being addressed in UJS. I think that's enough.

@handrews
Copy link
Contributor Author

handrews commented Oct 7, 2022

That problem is now being addressed in UJS. I think that's enough.

Only normative requirements are testable. UJS is important as education and guidance (of which I agree we have too much in the spec), but from a spec compliance perspective, it's irrelevant.

Let's assume the test suite was expanded to explicitly test the input interface of JSON Schema implementations. It doesn't matter how (I'm aware that there would be many difficulties), let's just assume it works.

I want a test case that tests whether a JSON Schema implementation can accept a base IRI and use it to resolve a relative $id in the document root. This is needed to support base URIs in accordance with RFC 3986 §5.1.2-5.1.4, and I want to know which implementations support that feature. I am aware that we have some test cases that involve this sort of behavior, but they do not explicitly test that you can supply a base URI for a given schema evaluation.

I want that test case to either be in the required set, or the "should" set, depending on whether we think the requirement is absolute or whether it can be disregarded under some circumstances.

@Julian, would the mention in Understanding JSON Schema that @jdesrosiers notes be sufficient for you to agree to such a required or "should" case? Would the current language in §9.1.1 of the core spec be sufficient?

My expectation would be that no, it would not. There is nothing in the spec that states such a requirement. Between RFC 3986 and JSON Schema Core, it's clear how to use a retrieval URI, default URI, etc. is intended to be used if it can be used. But there is nothing that requires an implementation to be able to use it.

We can't consider something a testable requirement when it's not even a requirement at all. UJS does not create requirements, and §9.1.1 doesn't state one either.

@Julian
Copy link
Member

Julian commented Oct 7, 2022

(Hi hi. Will review the relevant section but may take me till Monday to do so I'm flying the next few days.)

@jdesrosiers
Copy link
Member

Of course a UJS page doesn't imply any requirement on implementations. It just helps people understand the concepts so they can make better choices about what they support in their implementation. You can still have tests in the test suite for these features to help understand what implementations support, but they would be optional. Implementations that don't support those features don't need to pass those tests to be in compliance.

@handrews
Copy link
Contributor Author

handrews commented Oct 7, 2022

You can still have tests in the test suite for these features to help understand what implementations support, but they would be optional. Implementations that don't support those features don't need to pass those tests to be in compliance.

The whole point of this issue is to clarify this as a SHOULD or MUST requirement. You are welcome to argue that the requirement ought to be MAY or unstated, but despite repeated requests you have not explained what problems would be caused by a SHOULD or MUST requirement.

Your statement that such a test case would be optional proves my point that this needs normative language. If you think that SHOULD is too high of a requirement, please explain what problems a SHOULD requirement would create. Likewise for MUST. I've asked this several times without a response. I already understand that your opinion is that the status quo is fine, but I have given examples of why I think this is an important use case to support, and therefore worth at least a SHOULD.

What problems would a SHOULD requirement cause? Who would be harmed, and how? As an example, if we were to state a MUST requirement to support uniqueItems for arbitrarily large arrays with arbitrarily large items, that would cause harm by placing unrealistic memory requirements on JSON Schema implementations. If we were to to state that $schema MUST always be present, that would cause harm by making casual schema use in closed systems more cumbersome.

What is the harm caused by a requirement that implementations SHOULD accept a base URI?

@handrews
Copy link
Contributor Author

handrews commented Oct 7, 2022

To re-state the use case:

There are many use cases for a non-RFC3986-compliant base URI. RFC 3986 tells us how to determine the base URI correctly, but sometimes what is technically incorrect is more appropriate. A set of schemas that can be hosted at different locations (because it is part of a system that is expected to run behind firewalls and therefore not have a globally accessible URL) will have relative $ids. During development and testing, it would make more sense to load them from a filesystem, which means the retrieval URIs would be file:// URIs. But instead we want https:// URIs because that's what will be used in production. So we override RFC 3986 and supply the base URI for the test environment we need.

@Julian
Copy link
Member

Julian commented Oct 10, 2022

To tie the loop on

@Julian, would the mention in Understanding JSON Schema that @jdesrosiers notes be sufficient for you to agree to such a required or "should" case? Would the current language in §9.1.1 of the core spec be sufficient?

I'll share my opinion just on the "where does it seem to me it should go" question, not on what I understand the root proposal (and/or disagreement) to be, which, to explicitly ensure I follow, is about whether our specification should specifically recommend implementers include a place to specify a base URI to use globally when encountering a schema which doesn't indicate one and needs it. My brief understanding of @jdesrosiers' objection to this is I think that Jason you prefer we instead recommend implementations take a retrieval URI per-schema, and are less bothered by suggesting all implementations also support a global default base URI to use when seeing a schema with no retrieval URI and which uses relative references.

So if that's all a correct basic understanding of the proposal (or even if it isn't), I think the ping to me was just to see if I agree about where tests would go for such a thing?

In the current test suite layout where everything goes in optional/, the answer is easy, it's "there". It doesn't matter that the spec doesn't have specific language for recommending that implementations do this since we have just one bucket for "optional", in some sense simple. Jason I'm glad/surprised you said:

You can still have tests in the test suite for these features to help understand what implementations support, but they would be optional. Implementations that don't support those features don't need to pass those tests to be in compliance.

I thought you disagreed we should have such tests, but maybe I misunderstood your opinion before. I agree certainly with this line though, if we want implementations to support this and it isn't required, then yeah, there.

If we're talking post json-schema-org/JSON-Schema-Test-Suite#590 where now we separate things by whether the spec really recommends them or says nothing (which isn't merged yet because I only had feedback from Greg who +1'ed, Henry who sounded -0ish, and Jason who was +0ish, so I'd like to see some other folks who use the suite comment first):

In the current language which doesn't address this yeah I'd think this should go in additional not should, since it doesn't appear anywhere (and since it seemed after the brief back and forth with Austin?that everyone's in agreement that RFC3986 doesn't cover this recommendation either?) -- which is what you're trying to address, yes Henry? You feel strongly that implementations should support this, and are trying to suggest language to make the tests go into should instead because we explicitly recommend doing this.

If I got all that right, to me it seems there's a core disagreement about whether we want this recommendation or not (and instead want one recommending implementations take retrieval URIs, and just light guidance on whether implementations support this too). I think my understanding of Jason's objection is he just doesn't think this is necessary, and prefers implementations instead focus on APIs that ensure every schema has a retrieval URI. His implementation, if I'm not mistaken, indeed doesn't support specifying a global base URI, it requires specifying retrieval URIs alongside every schema, so this change would mean hyperjump going against the SHOULD basically.

Hopefully some of that is correct / helpful?

@handrews
Copy link
Contributor Author

handrews commented Oct 12, 2022

Thanks, @Julian , that is helpful.

is about whether our specification should specifically recommend implementers include a place to specify a base URI to use globally when encountering a schema which doesn't indicate one and needs it. My brief understanding of @jdesrosiers' objection to this is I think that Jason you prefer we instead recommend implementations take a retrieval URI per-schema, and are less bothered by suggesting all implementations also support a global default base URI to use when seeing a schema with no retrieval URI and which uses relative references.

This is incorrect. This issue has nothing to do with a global base URI (meaning an RFC 3986 §5.1.4 Default Base URI for JSON Schema implementations). I don't understand why this keeps becoming the topic of discussion, but perhaps it's due to the mis-use of the term "default base IRI" in §9.1.1 ¶1? I have filed draft PR #1324 to show what I consider to be the correct wording there, which is the starting point for this proposal.

My brief understanding of @jdesrosiers' objection to this is I think that Jason you prefer we instead recommend implementations take a retrieval URI per-schema, and are less bothered by suggesting all implementations also support a global default base URI to use when seeing a schema with no retrieval URI and which uses relative references.

§9.1 Loading a Schema, including §9.1.1 Initial Base IRI is about the process of loading a single schema (whether on its own prior to use, or as part of an instance evaluation). So this proposal is already a per-schema proposal.

I think my understanding of Jason's objection is he just doesn't think this is necessary, and prefers implementations instead focus on APIs that ensure every schema has a retrieval URI. His implementation, if I'm not mistaken, indeed doesn't support specifying a global base URI, it requires specifying retrieval URIs alongside every schema, so this change would mean hyperjump going against the SHOULD basically.

Currently, §9.1.1 duplicates a lot of RFC 3986 (which per #1322 we agree should be fixed), implying that relative references in a schema that does not have an absolute $id in its root can only be resolved either by accepting some sort of external base IRI (which could be determined by any of §5.1.2, 5.1.3, or 5.1.4 - JSON Schema has no way to tell) or by considering the JSON Schema implementation itself to be the "application" supplying a default base IRI in accordance with §5.1.4.

All this proposal does is strengthen the implied ability to take a base IRI of some sort to a SHOULD. It doesn't change or strengthen anything about the §5.1.4 "default base IRI" option. It does not say anything about how the ability to take a base IRI is to be accomplished (quoting from the initial comment):

the exact mechanism is not something we should specify.

According to Hyperjump JSC's documentation regarding schema identification:

An external identifier is an identifier that is specified outside of the schema. In JSC, an external identifier can be either the URL a schema is retrieved with, or the identifier specified when using Schema.add to load a schema.
...
Internal identifiers ($ids) are resolved against the external identifier of the schema (if one exists) and the resulting URI is used to identify the schema. All identifiers must be absolute URIs. External identifiers are required to be absolute URIs and internal identifiers must resolve to absolute URIs.

Therefore, Hyperjump JSC is already in compliance with this proposal. Which is why I do not understand the vehement objections. There is zero impact on Hyperjump (or, for that matter, python-jsonschema AFAIK). This is why I keep asking who would be harmed by this SHOULD, and how? I cannot figure out the downside.


As for the upside:

In the current language which doesn't address this yeah I'd think this should go in additional not should, since it doesn't appear anywhere

Right, that's what I would expect, and if we were to merge test suite PR 590 and add a test case for this, I'd expect it to go into additioinal. And that's the problem.

As @jdesrosiers has made clear by stating that he does not run the optional tests, and as can be observed by looking over how a variety of implementations do and do not use the test suite, a requirement that can only be tested in additional will not be consistently tested even if it is implemented.

When I asked about reference resolution tooling requirements in the AsyncAPI discussion (which, of course, incorporates JSON Schema referencing), the response was:

My main priority is that we should never end up in a case where a document works in one tool and not in another where both say they are spec compliant because you all of the sudden have a feature in one of the tools that the other is not aware of.

This is an interoperability concern and fits the recommendations of RFC 2119 §6 regarding the use of MUST and SHOULD.

The only way to guarantee consistent behavior as requested is with a MUST or SHOULD normative requirement. If it's a SHOULD, there ought to be guidance on when it is safe to disregard. Such a requirement could then be reflected in either the required test suite or the (future) "should" suite. At least, if one ignores that this is not necessarily testable through the validation output (although some aspects are, and are already covered).

@handrews
Copy link
Contributor Author

My preference is to close this in favor of the referencing discussion. It can be re-filed if needed later. I will close this if there are no objections in the next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

6 participants