Merge pull request #201 from RusticiSoftware/master

Added more supporting methods for URN paths.
medialize · Mar 31, 2015 · 55d5a98 · 55d5a98
2 parents 3e44dcf + 158011d
commit 55d5a98
Show file tree

Hide file tree

Showing 6 changed files with 173 additions and 35 deletions.
diff --git a/README.md b/README.md
@@ -158,6 +158,9 @@ Documents specifying how URLs work:
 * [RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax](http://tools.ietf.org/html/rfc3986)
 * [RFC 3987 - Internationalized Resource Identifiers (IRI)](http://tools.ietf.org/html/rfc3987)
 * [RFC 2732 - Format for Literal IPv6 Addresses in URL's](http://tools.ietf.org/html/rfc2732)
+* [RFC 2368 - The `mailto:` URL Scheme](https://www.ietf.org/rfc/rfc2368.txt)
+* [RFC 2141 - URN Syntax](https://www.ietf.org/rfc/rfc2141.txt)
+* [IANA URN Namespace Registry](http://www.iana.org/assignments/urn-namespaces/urn-namespaces.xhtml)
 * [Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)](http://tools.ietf.org/html/rfc3492)
 * [application/x-www-form-urlencoded](http://www.w3.org/TR/REC-html40/interact/forms.html#form-content-type) (Query String Parameters) and [application/x-www-form-urlencoded encoding algorithm](http://www.whatwg.org/specs/web-apps/current-work/multipage/association-of-controls-and-forms.html#application/x-www-form-urlencoded-encoding-algorithm)
 * [What every web developer must know about URL encoding](http://blog.lunatech.com/2009/02/03/what-every-web-developer-must-know-about-url-encoding)
@@ -243,6 +246,13 @@ URI.js is published under the [MIT license](http://www.opensource.org/licenses/m
 
 ### master (will become 1.15.0)
 
+* fixed [`.pathname()`](http://medialize.github.io/URI.js/docs.html#accessors-pathname) to properly en/decode URN paths - ([Issue #201](https://github.com/medialize/URI.js/pull/201), [mlefoster](https://github.com/mlefoster))
+* fixing URI normalization to properly handle URN paths based on [RFC 2141](https://www.ietf.org/rfc/rfc2141.txt) syntax - ([Issue #201](https://github.com/medialize/URI.js/pull/201), [mlefoster](https://github.com/mlefoster))
+  * fixed [`.normalize()`](http://medialize.github.io/URI.js/docs.html#normalize) and [`.normalizePath()`](http://medialize.github.io/URI.js/docs.html#normalize-path) to properly normalize URN paths
+  * added `URI.encodeUrnPathSegment()`
+  * added `URI.decodeUrnPathSegment()`
+  * added `URI.decodeUrnPath()`
+  * added `URI.recodeUrnPath()`
 * fixing `URI(undefined)` to throw TypeError - ([Issue #189](https://github.com/medialize/URI.js/issues/189)) - tiny backward-compatibility-break
 
 ### 1.14.2 (February 25th 2015) ###

diff --git a/about-uris.html b/about-uris.html
@@ -50,9 +50,39 @@ <h2>Understanding URIs</h2>
     </p>
 
     <p>
-        URLs are used to address the individual resources of your website. 
-        URNs are usually used for hooking into other applications, as <code>mailto:</code>, <code>magnet:</code> or <code>spotify:</code> suggest. 
-        While RFC 3986 defines the structure of an URL in depth, URNs are not. The structure (and meaning) of URNs are up to their distinct specifications.
+        URNs <em>name</em> a resource.
+        They are (supposed to) designate a globally unique, permanent identifier for that resource.
+        For example, the URN <code>urn:isbn:0201896834</code> uniquely identifies Volume 1 of Donald Knuth's <em>The Art of Computer Porgramming</em>.
+        Even if that book goes out of print, that URN will continue to identify that particular book in that particular printing.
+        While the term &quot;URN&quot; <em>technically</em> refers to a specific URI scheme laid out by <a href="http://tools.ietf.org/html/rfc2141">RFC 2141</a>,
+        the previously-mentioned RFC 3986 indicates that in common usage &quot;URN&quot; refers to any kind of URI that identifies a resource.
+    </p>
+
+    <p>
+        URLs <em>locate</em> a resource.
+        They designate a protocol to use when looking up the resource and provide an &quot;address&quot; for finding the resource within that scheme.
+        For example, the URL <code><a href="http://tools.ietf.org/html/rfc3986">http://tools.ietf.org/html/rfc3986</a></code> tells the consumer (most likely a web browser)
+        to use the HTTP protocol to access whatever site is found at the <code>/html/rfc3986</code> path of <code>tools.ietf.org</code>.
+        URLs are not permanent; it is possible that in the future that the IETF will move to a different domain or even that some other organization will acquire the rights to <code>tools.ietf.org</code>.
+        It is also possible that multiple URLs may locate the same resource;
+        for example, an admin at the IETF might be able to access the document found at the example URL via the <code>ftp://</code> protocol.
+    </p>
+
+    <h2>URLs and URNs in URI.js</h2>
+
+    <p>
+        The distinction between URLs and URNs is one of <strong>semantics</strong>.
+        In principle, it is impossible to tell, on a purely syntactical level, whether a given URI is a URN or a URL without knowing more about its scheme.
+        Practically speaking, however, URIs that look like HTTP URLs (scheme is followed by a colon and two slashes, URI has an authority component, and paths are delimited by slashes) tend to be URLs,
+        and URIs that look like RFC 2141 URNs (scheme is followed by a colon, no authority component, and paths are delimited by colons) tend to be URNs (in the broad sense of &quot;URIs that name&quot;).
+    </p>
+
+    <p>
+        So, for the purposes of URI.js, the distinction between URLs and URNs is treated as one of <strong>syntax</strong>.
+        The main functional differences between the two are that (1) URNs will not have an authority element and
+        (2) when breaking the path of the URI into segments, the colon will be used as the delimiter rather than the slash.
+        The most surprising result of this is that <code>mailto:</code> URLs will be considered by URI.js to be URNs rather than URLs.
+        That said, the functional differences will not adversely impact the handling of those URLs.
     </p>
 
     <h2 id="components">Components of an URI</h2>
@@ -108,7 +138,7 @@ <h3 id="components-urn">Components of an <abbr title="Uniform Resource Name">URN
 </span>  <a href="docs.html#accessors-protocol">scheme</a>       <a href="docs.html#accessors-pathname">path</a> &amp; <a href="docs.html#accessors-segment">segment</a>         <a href="docs.html#accessors-search">query</a>   <a href="docs.html#accessors-hash">fragment</a>
     </pre>
 
-    <p>While <a href="http://tools.ietf.org/html/rfc3986">RFC 3986</a> does not define URNs having a query or fragment component, URI.js enables these accessors for convenience.</p>
+    <p>While <a href="http://tools.ietf.org/html/rfc2141">RFC 2141</a> does not define URNs having a query or fragment component, URI.js enables these accessors for convenience.</p>
 
     <h2 id="problems">URLs - Man Made Problems</h2>
 

diff --git a/docs.html b/docs.html
@@ -579,7 +579,7 @@ <h3 id="is">is()</h3>
     <dl>
         <dt><code>relative</code></dt><dd><code>true</code> if URL doesn't have a hostname</dd>
         <dt><code>absolute</code></dt><dd><code>true</code> if URL has a hostname</dd>
-        <dt><code>urn</code></dt><dd><code>true</code> if URI is a URN</dd>
+        <dt><code>urn</code></dt><dd><code>true</code> if URI looks like a URN</dd>
         <dt><code>url</code></dt><dd><code>true</code> if URI is a URL</dd>
         <dt><code>domain</code>, <code>name</code></dt><dd><code>true</code> if hostname is not an IP</dd>
         <dt><code>sld</code></dt><dd><code>true</code> if hostname is a second level domain (i.e. "example.co.uk")</dd>

diff --git a/index.html b/index.html
@@ -146,10 +146,10 @@ <h2>Examples</h2>
 // required src/URI.fragmentURI.js to be loaded</pre>
 
     <p>How do you like parsing URNs?</p>
-    <pre class="prettyprint lang-js">var uri = URI("mailto:hello@example.org?subject=hello");
-uri.protocol() == "mailto";
-uri.path() == "hello@example.org";
-uri.query() == "subject=hello";</pre>
+    <pre class="prettyprint lang-js">var uri = URI("urn:uuid:c5542ab6-3d96-403e-8e6b-b8bb52f48d9a?query=string");
+uri.protocol() == "urn";
+uri.path() == "uuid:c5542ab6-3d96-403e-8e6b-b8bb52f48d9a";
+uri.query() == "query=string";</pre>
 
     <p>How do you like URI Templating?</p>
     <pre class="prettyprint lang-js">URI.expand("/foo/{dir}/{file}", {

diff --git a/src/URI.js b/src/URI.js
@@ -324,6 +324,42 @@
           '%3D': '='
         }
       }
+    },
+    urnpath: {
+      // The characters under `encode` are the characters called out by RFC 2141 as being acceptable
+      // for usage in a URN. RFC2141 also calls out "-", ".", and "_" as acceptable characters, but
+      // these aren't encoded by encodeURIComponent, so we don't have to call them out here. Also
+      // note that the colon character is not featured in the encoding map; this is because URI.js
+      // gives the colons in URNs semantic meaning as the delimiters of path segements, and so it
+      // should not appear unencoded in a segment itself.
+      // See also the note above about RFC3986 and capitalalized hex digits.
+      encode: {
+        expression: /%(21|24|27|28|29|2A|2B|2C|3B|3D|40)/ig,
+        map: {
+          '%21': '!',
+          '%24': '$',
+          '%27': '\'',
+          '%28': '(',
+          '%29': ')',
+          '%2A': '*',
+          '%2B': '+',
+          '%2C': ',',
+          '%3B': ';',
+          '%3D': '=',
+          '%40': '@'
+        }
+      },
+      // These characters are the characters called out by RFC2141 as "reserved" characters that
+      // should never appear in a URN, plus the colon character (see note above).
+      decode: {
+        expression: /[\/\?#:]/g,
+        map: {
+          '/': '%2F',
+          '?': '%3F',
+          '#': '%23',
+          ':': '%3A'
+        }
+      }
     }
   };
   URI.encodeQuery = function(string, escapeQuerySpace) {
@@ -350,22 +386,6 @@
       return string;
     }
   };
-  URI.recodePath = function(string) {
-    var segments = (string + '').split('/');
-    for (var i = 0, length = segments.length; i < length; i++) {
-      segments[i] = URI.encodePathSegment(URI.decode(segments[i]));
-    }
-
-    return segments.join('/');
-  };
-  URI.decodePath = function(string) {
-    var segments = (string + '').split('/');
-    for (var i = 0, length = segments.length; i < length; i++) {
-      segments[i] = URI.decodePathSegment(segments[i]);
-    }
-
-    return segments.join('/');
-  };
   // generate encode/decode path functions
   var _parts = {'encode':'encode', 'decode':'decode'};
   var _part;
@@ -387,8 +407,40 @@
 
   for (_part in _parts) {
     URI[_part + 'PathSegment'] = generateAccessor('pathname', _parts[_part]);
+    URI[_part + 'UrnPathSegment'] = generateAccessor('urnpath', _parts[_part]);
   }
 
+  var generateSegmentedPathFunction = function(_sep, _codingFuncName, _innerCodingFuncName) {
+    return function(string) {
+      // Why pass in names of functions, rather than the function objects themselves? The
+      // definitions of some functions (but in particular, URI.decode) will occasionally change due
+      // to URI.js having ISO8859 and Unicode modes. Passing in the name and getting it will ensure
+      // that the functions we use here are "fresh".
+      var actualCodingFunc;
+      if (!_innerCodingFuncName) {
+        actualCodingFunc = URI[_codingFuncName];
+      } else {
+        actualCodingFunc = function(string) {
+          return URI[_codingFuncName](URI[_innerCodingFuncName](string));
+        };
+      }
+
+      var segments = (string + '').split(_sep);
+
+      for (var i = 0, length = segments.length; i < length; i++) {
+        segments[i] = actualCodingFunc(segments[i]);
+      }
+
+      return segments.join(_sep);
+    };
+  };
+
+  // This takes place outside the above loop because we don't want, e.g., encodeUrnPath functions.
+  URI.decodePath = generateSegmentedPathFunction('/', 'decodePathSegment');
+  URI.decodeUrnPath = generateSegmentedPathFunction(':', 'decodeUrnPathSegment');
+  URI.recodePath = generateSegmentedPathFunction('/', 'encodePathSegment', 'decode');
+  URI.recodeUrnPath = generateSegmentedPathFunction(':', 'encodeUrnPathSegment', 'decode');
+
   URI.encodeReserved = generateAccessor('reserved', 'encode');
 
   URI.parse = function(string, parts) {
@@ -946,9 +998,13 @@
   p.pathname = function(v, build) {
     if (v === undefined || v === true) {
       var res = this._parts.path || (this._parts.hostname ? '/' : '');
-      return v ? URI.decodePath(res) : res;
+      return v ? (this._parts.urn ? URI.decodeUrnPath : URI.decodePath)(res) : res;
     } else {
-      this._parts.path = v ? URI.recodePath(v) : '/';
+      if (this._parts.urn) {
+        this._parts.path = v ? URI.recodeUrnPath(v) : '';
+      } else {
+        this._parts.path = v ? URI.recodePath(v) : '/';
+      }
       this.build(!build);
       return this;
     }
@@ -1624,6 +1680,7 @@
     if (this._parts.urn) {
       return this
         .normalizeProtocol(false)
+        .normalizePath(false)
         .normalizeQuery(false)
         .normalizeFragment(false)
         .build();
@@ -1670,16 +1727,22 @@
     return this;
   };
   p.normalizePath = function(build) {
+    var _path = this._parts.path;
+    if (!_path) {
+      return this;
+    }
+
     if (this._parts.urn) {
+      this._parts.path = URI.recodeUrnPath(this._parts.path);
+      this.build(!build);
       return this;
     }
 
-    if (!this._parts.path || this._parts.path === '/') {
+    if (this._parts.path === '/') {
       return this;
     }
 
     var _was_relative;
-    var _path = this._parts.path;
     var _leadingParents = '';
     var _parent, _pos;
 
@@ -1763,9 +1826,12 @@
 
     URI.encode = escape;
     URI.decode = decodeURIComponent;
-    this.normalize();
-    URI.encode = e;
-    URI.decode = d;
+    try {
+      this.normalize();
+    } finally {
+      URI.encode = e;
+      URI.decode = d;
+    }
     return this;
   };
 
@@ -1776,9 +1842,12 @@
 
     URI.encode = strictEncodeURIComponent;
     URI.decode = unescape;
-    this.normalize();
-    URI.encode = e;
-    URI.decode = d;
+    try {
+      this.normalize();
+    } finally {
+      URI.encode = e;
+      URI.decode = d;
+    }
     return this;
   };
 

diff --git a/test/test.js b/test/test.js
@@ -236,6 +236,20 @@
     equal(u.pathname(), '/', 'empty absolute path');
     equal(u.toString(), '/', 'empty absolute path to string');
   });
+  test('URN paths', function() {
+    var u = new URI('urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66?foo=bar');
+    u.pathname('uuid:de305d54-75b4-431b-adb2-eb6b9e546013');
+    equal(u.pathname(), 'uuid:de305d54-75b4-431b-adb2-eb6b9e546013');
+    equal(u + '', 'urn:uuid:de305d54-75b4-431b-adb2-eb6b9e546013?foo=bar');
+
+    u.pathname('');
+    equal(u.pathname(), '', 'changing pathname ""');
+    equal(u+'', 'urn:?foo=bar', 'changing url ""');
+
+    u.pathname('music:classical:Béla Bártok%3a Concerto for Orchestra');
+    equal(u.pathname(), 'music:classical:B%C3%A9la%20B%C3%A1rtok%3A%20Concerto%20for%20Orchestra', 'path encoding');
+    equal(u.pathname(true), 'music:classical:Béla Bártok%3A Concerto for Orchestra', 'path decoded');
+  });
   test('query', function() {
     var u = new URI('http://example.org/foo.html');
     u.query('foo=bar=foo');
@@ -1050,6 +1064,20 @@
     u = URI('/../../../../../www/common/js/app/../../../../www_test/common/js/app/views/view-test.html');
     u.normalize();
     equal(u.path(), '/www_test/common/js/app/views/view-test.html', 'parent absolute');
+
+    // URNs
+    u = URI('urn:people:authors:poets:Shel Silverstein');
+    u.normalize();
+    equal(u.path(), 'people:authors:poets:Shel%20Silverstein');
+
+    u = URI('urn:people:authors:philosophers:Søren Kierkegaard');
+    u.normalize();
+    equal(u.path(), 'people:authors:philosophers:S%C3%B8ren%20Kierkegaard');
+
+    // URNs path separator preserved
+    u = URI('urn:games:cards:Magic%3A the Gathering');
+    u.normalize();
+    equal(u.path(), 'games:cards:Magic%3A%20the%20Gathering');
   });
   test('normalizeQuery', function() {
     var u = new URI('http://example.org/foobar.html?');
@@ -1559,6 +1587,7 @@
 
     equal(URI.decodeQuery('%%20'), '%%20', 'malformed URI component returned');
     equal(URI.decodePathSegment('%%20'), '%%20', 'malformed URI component returned');
+    equal(URI.decodeUrnPathSegment('%%20'), '%%20', 'malformed URN component returned');
   });
   test('encodeQuery', function() {
     var escapeQuerySpace = URI.escapeQuerySpace;