Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect result for Buffer#toString #6075

Closed
kuldeepaggarwal opened this issue Apr 6, 2016 · 6 comments · May be fixed by adamlaska/node#38
Closed

Incorrect result for Buffer#toString #6075

kuldeepaggarwal opened this issue Apr 6, 2016 · 6 comments · May be fixed by adamlaska/node#38
Labels
buffer Issues and PRs related to the buffer subsystem. invalid Issues and PRs that are invalid.

Comments

@kuldeepaggarwal
Copy link

  • Version: v5.10.1
  • Platform: Darwin KD 14.5.0 Darwin Kernel Version 14.5.0: Tue Sep 1 21:23:09 PDT 2015; root:xnu-2782.50.1~1/RELEASE_X86_64 x86_64
  • Subsystem: Buffer.js

Hello Team,

I am using older version(0.10.40) on one my project and facing some issue while grouping of data type Buffer. Issue also present on latest version.

Issue

this.process.version // => 'v5.10.1'
var a = new Buffer([217, 132, 45, 138, 77, 111, 17, 228, 138, 121, 0, 80, 86, 59, 49, 192])
var b = new Buffer([217, 132, 45, 180, 77, 111, 17, 228, 138, 121, 0, 80, 86, 59, 49, 192])
a.toString() // => 'ل-�Mo\u0011��y\u0000PV;1�'
b.toString() // => 'ل-�Mo\u0011��y\u0000PV;1�'
a.toString('hex') // => 'd9842d8a4d6f11e48a790050563b31c0'
b.toString('hex') // => 'd9842db44d6f11e48a790050563b31c0'

If you see carefully, then there is difference in the hex values of both the variables but their utf8 string values are exactly which is actually creating problem for us.

Actual Use Case

users = [
  {
    uuid: new Buffer([217, 132, 45, 138, 77, 111, 17, 228, 138, 121, 0, 80, 86, 59, 49, 192]),
  },
  {
    uuid: new Buffer([217, 132, 45, 180, 77, 111, 17, 228, 138, 121, 0, 80, 86, 59, 49, 192]),
  }
]

posts = [
  { title: 'First Post', user_uuid: new Buffer([217, 132, 45, 138, 77, 111, 17, 228, 138, 121, 0, 80, 86, 59, 49, 192]) },
  { title: 'Second Post', user_uuid: new Buffer([217, 132, 45, 180, 77, 111, 17, 228, 138, 121, 0, 80, 86, 59, 49, 192]) }
]

_.groupBy(posts, function(post) {
  return post.user_uuid; // written in other library, like: Bookshelf
})

Expected Result

We should have 2 keys as user_uuid for posts are different.

Actual Result

All the posts are grouped under same key, because #toString() returns same value for both the buffer object.

Fix

Buffer#toString should have default hex encoding output.

Please let me know if I am on wrong path or understood incorrectly or it should be fix on other libraries itself. And if you think that I am on right path and it should be fixed here then I can raise a PR for the same.

@bnoordhuis
Copy link
Member

The issue is that both input buffers contain invalid character sequences that get substituted with the replacement character, u+FFFD. That's why the UTF-8 strings are the same - the replacements are in the same locations - but the hexadecimal representation is not.

I'll close, Buffer#toString() is working as expected and documented in this case.

@bnoordhuis bnoordhuis added invalid Issues and PRs that are invalid. buffer Issues and PRs related to the buffer subsystem. labels Apr 6, 2016
@kuldeepaggarwal
Copy link
Author

@bnoordhuis I don't understand where input buffers contain invalid characters? Can you please point out where the input is wrong.

@bnoordhuis
Copy link
Member

For example, in the sequence [217, 132, 45, 138, 77...], 138 is not a valid starting point for a UTF-8 character sequence because those always have the two most significant bits set (EDIT: except for single-byte characters, of course.) 138 & 192 is 128 when it should be 192 (because 192 == 128 + 64.)

@kuldeepaggarwal
Copy link
Author

I apologize that I still don't understand the concept. Can you please provide some reference where I can read about Buffer in detail so that I could understand what you meant.

I might be looking dumb here.

@bnoordhuis
Copy link
Member

The no-argument version of buf.toString() interprets the bytes in the buffer as UTF-8 and turns that into a string. One or more bytes make up a character; https://en.wikipedia.org/wiki/UTF-8#Description explains what those byte sequences look like. Not all sequences are valid; those are replaced with a U+FFFD character.

Hope that clears it up.

@kuldeepaggarwal
Copy link
Author

Thanks a lot @bnoordhuis 💙 💛 💚 💜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. invalid Issues and PRs that are invalid.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants