-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added more supporting methods for URN paths. #201
Conversation
Use case: I work on some software that needs to accept URNs that may be improperly formatted and encode them properly. This patch introduces a new method, `recodeURNPath`, that will break the URN's path into segments and make sure that each part of it is encoded properly. Risks: URI.js treats "URN" as a catch-all category for everything that doesn't look like an HTTP URL. Many of these schemes may or may not follow the URN syntactical rules.
I think this is the single best PR I've been sent on any of my projects so far. Thank you. I agree with your assessments, thank you for the explanation. I have indeed considered the distinction between URLs and URNs to be syntactic. If |
// for usage in a URN. RFC2141 also calls out "-", ".", and "_" as acceptable characters, but | ||
// these aren't encoded by encodeURIComponent, so we don't have to call them out here. Also | ||
// note that the colon character is not featured in the encoding map; this is because URI.js | ||
// gives the colons in URNs semantic meaning as the delimiters of path segements, and so it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If :
is neither the URN path delimiter, nor the "industry default path delimiter", we need to talk about what is.
So before I address any specific comments, I think it might be better to have the discussion about how much more would need to change—just so we don't do any work that gets obsoleted soon. Personally, I think the distinction URI.js makes right now between URIs and URLs is fine, so long as it is explained in the docs and perhaps given a different name. There are some URI schemes out there that aren't technically URNs in the full RFC 2141 sense, but they are used to name things rather than locate them (and so they would fall under the "historical usage" talked about in RFC 3986). More importantly, many of these look a whole lot like URNs: see the DOI scheme (e.g., So, in practice, I think it's perfectly valid to have URI.js distinguish between two kinds of URIs: those that look like HTTP URLs ( The change I might suggest instead is to change the documentation so that the distinction is made between URN-ish and URL-ish URIs. As for the URN syntax and the use of the colon: According to the RFC, URNs have the following syntax: I just went and skimmed the RFCs of the first twenty formal URN namespaces registered here and it appears that every single one of them uses the colon as something like a path separator. So I think for all practical purposes URI.js is totally valid in giving the colon character special significance as a path separator. Again, there's no way that URI.js can accommodate all possible URI syntaxes (and all possible URN namespace-specific syntaxes), so I think practicality can win the day here. So, in conclusion, I think that after cleaning this current PR up, the only thing that would really need to change is the documentation, to make it clearer what URI.js really means by "URN". If you agree with that assessment, I'll go ahead and start cleaning up this PR. |
well. You should be able to work with every scheme. But that doesn't mean that stuff like
I do, for better or worse. Do you also want take a shot at the docs (may be a second PR if you wish)? |
I'll take a shot at the docs, but it may or may not be this PR. Let me figure out how hard it would be to do that. |
My local changes created a bug that caused an exception to be thrown during the `normalize` call of these methods, which then caused later tests to fail because the wrong decoding/encoding function was being used.
Alright, I've responded to the comments and made the changes to the docs. Let me know if you'd like other things changed. |
great stuff! What resources besides RFC 2141, the list of official URNs, RFC 2368 (other RFC for popular schemes?) do you think should go into the Readme.md? |
Thanks! I'll think on that last question. I'll address the new comments (as well as compile a list of resources) over the weekend. |
1. Used more meaningful tags in the documentation. 2. method names with "URN" in them have been camel-cased to use "Urn" instead. 3. A list of resources and a changelog has been added to the README.
Responded to all above comments. Do you want the above work to be rebased/squashed into a single commit? |
@@ -243,6 +246,7 @@ URI.js is published under the [MIT license](http://www.opensource.org/licenses/m | |||
|
|||
### master (will become 1.15.0) | |||
|
|||
* URNs are now normalized based on the syntax given by [RFC 2141](https://www.ietf.org/rfc/rfc2141.txt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proper format and content would be:
- fixed
.pathname()
to properly en/decode URN paths - (Issue #201, mlefoster) - fixing URI normalization to properly handle URN paths based on RFC 2141 syntax - (Issue #201, mlefoster)
- fixed
.normalize()
and.normalizePath()
to properly normalize URN paths - added
URI.encodeUrnPathSegment()
- added
URI.decodeUrnPathSegment()
- added
URI.decodeUrnPath()
- added
URI.recodeUrnPath()
- fixed
nice!
I can't say I have an opinion on that. But sure, you can squash the commits if you want to. |
README file updated. I won't bother with squashing, since there are some comments on commits. (Rebasing changes commit hashes, which means the comments become hard to locate.) |
Added more supporting methods for URN paths.
thank you! |
released v1.15.0, thank you for your support! |
Thanks for including it! |
Hey @mlefoster , follow-up question: @Munter has created a repository identifying various official iana and unoffical schemes. By itself I don't see how it could be helpful to URI.js in providing better URN support. The question is what data would we want to add to scheme in order to accomplish anything significant? Or is there nothing left to be improved/simplified? |
I can see that you have an ad-hoc scheme to default port mapping in URI.js. That part might be nice to add to the schemes data object, if there are official resources I can scrape for them |
For my own part, I mainly use URI.js as a syntactical helper (in particular, normalizing and sanitizing bad inputs, particularly with respect to percent-encoding). From that perspective, the two things I can think of would be 1) a better mapping of schemes to default ports for those URIs that have ports and 2) maybe a mapping of schemes to their general syntax: whether the scheme supports a query string, what the path separator is, etc. The former is something URI.js already supports; it's just a matter of filling it out more if appropriate. As for 2), the question is whether that's practical. Are there URI schemes that fall outside the general HTTP-like and URN-like dichotomy? And if there are, are those schemes common enough that making those changes would be worth it? Are there enough handwritten/poorly-formatted URIs of those schemes floating around that people have a need to normalize them? |
I don't think normalizing is the main feature to go for. Could a proper scheme mapping (whatever that would look like) support semantics of a particular scheme? Think about |
First, I apologize if this should really be an issue instead of a PR, so that the solution could be properly planned in advance.
I work on a product is required to accept URIs which may or may not be properly encoded and reformat them as properly-encoded URIs. This is a hard problem to solve on our own, and so URI.js has been a godsend. Unfortunately, the most common type of URI we handle is a URN; and URI.js does not currently have anything like
recodePath
for URNs. This patch implements arecodeURNPath
method for URNs and calls it while normalizing a URN path.This PR might carry a risk with it, unfortunately. There is some equivocation in the documentation about the difference between URNs and URIs, which is reflected in the overall structure of URI.js:
Technically,
URN
refers to a very specific thing: Uniform Resource Names as defined by RFC 2141 (the syntactic requirements of which are implemented in this patch). Though there is some degree of freedom to the structure of URNs in a specific namespace, they cannot break the rules laid out in 2141 (such as which characters are valid in a URN). In addition, this means thatmailto:
andspotify:
URIs are not actually URNs at all (and in fact, the RFC definingmailto:
scheme, RFC 2368, calls it a URL).I think URI.js is drawing its inspiration from this paragraph in RFC 3986 (which is mentioned in the quoted documentation above):
The problem here is that even this is still a purely semantic distinction, while the distinction between URNs and URIs as implemented in URI.js is syntactic: URLs are anything that looks like
protocol://user:password@hostname/path/with/slashes?query=string
and URNs are anything that looks likescheme:path:with:colons?query=string
.All this is to say: introducing the syntactic requirements to normalize real URNs as defined in RFC 2141 may or may not actually apply to the things URI.js treats as URNs.
That said, in practice I think the risk isn't very significant. For one, users of URN-ish URIs already could not rely on URI.js for proper normalization of paths, and so I see no harm in URI.js applying its best effort. If a consumer needs a particular kind of syntax for their custom URI scheme, they always have had to implement that themselves, and so nothing changes here.