encoding issue with gb2312 data #87

axtens · 2013-05-21T04:43:07Z

Context: Microsoft JScript on Windows Server 2008 R2 64bit

var url = "http://www.google.com.hk/search?q=pennytel%20downloads&sa=%20%CB%D1%20%CB%F7%20&forid=1&prog=aff&ie=GB2312&oe=GB2312&safe=active&source=sdo_sb_html&hl=zh-CN";
var x = new URI(url);
var rMap = x.search(true);

When the .search is executed I get

Microsoft JScript runtime error: The URI to be decoded is not a valid encoding

The break occurs here

d.decodeQuery = function (a) {
    return d.decode((a + "").replace(/+/g, "%20"))
};

and it's probably complaining about the "sa=%20%CB%D1%20%CB%F7%20". What's amiss here? Is it fixable? Is it an encoding issue or something else?

The text was updated successfully, but these errors were encountered:

rodneyrehm · 2013-05-21T08:12:39Z

I can reproduce the issue in Firefox 21 on Mac. This sequence is the problem %CB%D1 - it can't be decoded by decodeURIComponent().

decodeURIComponent() expects UTF-8 escape sequences and fails if it can't resolve the input. Using unescape() the sequence resolves to ËÑ, which would properly be percent-encoded as %C3%8B%C3%91

Can you check what character's this sequence should resolve to? Can you make sure that the data is UTF-8?

axtens · 2013-05-22T08:41:55Z

As far as I can tell, given the ie and oe variables (&ie=GB2312&oe=GB2312), the characters are GB2312 encoded chinese characters. If I store ËÑË÷ in a text file and, using BabelPad, read them in as GB2312, I get 脣脩脣梅. That expressed as UTF-8 is, in hex, E8 84 A3 E8 84 A9 E8 84 A3 E6 A2 85.

Now, how to deal with this is tricky because the original url has come into our website via Google Hong Kong so we have no way of controlling how the data is encoded. Do I change URI.js to use unescape? At the moment, I run every url through unescape() anyway so that URI.js doesn't crash on the weird ones.

rodneyrehm · 2013-05-22T08:52:00Z

well, URI.js supports UTF8 and ISO 8859 mode. You could easily wrap things:

URI.prototype.getQueryParameters = function() {
  var uri = URI(this.search());
  try {
    return uri.search(true);
  } catch(e) {
    return uri.unicode().search(true);
  }
};

yielding: URI('?a=%CB%D1').getQueryParameters() == { a="ËÑ"}

I'm not sure if I'd want this to happen automatically, internally, without the implementor even noticing…

rodneyrehm · 2013-05-27T13:05:15Z

See #92 as well

axtens · 2013-06-28T05:28:59Z

This issue's popped up again and I'm trying to figure out how to get around it.
The URL in this case is
var url = "http://www.google.com.hk/search?q=pennytel downloads&sa= %CB%D1 %CB%F7 &forid=1&prog=aff&ie=GB2312&oe=GB2312&safe=active&source=sdo_sb_html&hl=zh-CN";
and the code which is breaking (with the same error and error-location as above)

var uri = new URI(url); 
//...
var uQuery = uri.clone().setQuery("");

It's the setQuery that's failing. How do I set my query to nothing without using setQuery()?

axtens · 2013-06-28T05:33:54Z

Ok, simple answer: var uQuery = uri.clone().search("");

…loses #92

rodneyrehm · 2013-08-03T16:57:03Z

I've fixed this in master - it will be included in the next release. thank you for your help!

QueryString data that cannot be decoded will now simply be returned undecoded - that way any decodable data can still be of use.

rodneyrehm added a commit that referenced this issue Aug 3, 2013

fixing crashing of URI.decodeQuery() on malformed input - closes #87, c…

fd8ee89

…loses #92

rodneyrehm closed this as completed Aug 3, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding issue with gb2312 data #87

encoding issue with gb2312 data #87

axtens commented May 21, 2013 •

edited by rodneyrehm

Loading

rodneyrehm commented May 21, 2013

axtens commented May 22, 2013 •

edited by rodneyrehm

Loading

rodneyrehm commented May 22, 2013

rodneyrehm commented May 27, 2013

axtens commented Jun 28, 2013 •

edited by rodneyrehm

Loading

axtens commented Jun 28, 2013

rodneyrehm commented Aug 3, 2013

encoding issue with gb2312 data #87

encoding issue with gb2312 data #87

Comments

axtens commented May 21, 2013 • edited by rodneyrehm Loading

rodneyrehm commented May 21, 2013

axtens commented May 22, 2013 • edited by rodneyrehm Loading

rodneyrehm commented May 22, 2013

rodneyrehm commented May 27, 2013

axtens commented Jun 28, 2013 • edited by rodneyrehm Loading

axtens commented Jun 28, 2013

rodneyrehm commented Aug 3, 2013

axtens commented May 21, 2013 •

edited by rodneyrehm

Loading

axtens commented May 22, 2013 •

edited by rodneyrehm

Loading

axtens commented Jun 28, 2013 •

edited by rodneyrehm

Loading