feat(ustring): ustring hash collision protection #4350

lgritz · 2024-07-21T18:12:27Z

The gist is that the ustring::strhash(str) function is modified to strip out the MSB from Strutil::strhash. The rep entry is filed in the ustring table based on this hash. So effectively, the computed hash is 63 bits, not 64.

But rep->hashed field consists of the lower 63 bits being the computed hash, and the MSB indicates whether this is the 2nd (or more) entry in the table that had the same 63 bit hash.

ustring::hash() then is modified as follows: If the MSB is 0, the computed hash is the hash. If the MSB is 1, though, we DON'T use that hash, and instead we use the pointer to the unique characters, but with the MSB set (that's an invalid address by itself). Note that the computed hashes never have MSB set, and the char*+MSB always have MSB set, so therefore ustring::hash() will never have the same value for two different ustrings.

But -- please note! -- that ustring::strhash(str) and ustring(str).hash() will only match (and also be the same value on every execution) if the ustring is the first to receive that hash, which should be approximately always. Probably always, in practice.

But in the very improbable case of a hash collision, one of them (the second to be turned into a ustring) will be using the alternate hash based on the character address, which is both not the same as ustring::strhash(chars), nor is it expected to be the same constant on every program execution.

The gist is that the ustring::strhash(str) function is modified to strip out the MSB from Strutil::strhash. The rep entry is filed in the ustring table based on this hash. So effectively, the computed hash is 63 bits, not 64. But rep->hashed field consists of the lower 63 bits being the computed hash, and the MSB indicates whether this is the 2nd (or more) entry in the table that had the same 63 bit hash. ustring::hash() then is modified as follows: If the MSB is 0, the computed hash is the hash. If the MSB is 1, though, we DON'T use that hash, and instead we use the pointer to the unique characters, but with the MSB set (that's an invalid address by itself). Note that the computed hashes never have MSB set, and the char*+MSB always have MSB set, so therefore ustring::hash() will never have the same value for two different ustrings. But -- please note! -- that ustring::strhash(str) and ustring(str).hash() will only match (and also be the same value on every execution) if the ustring is the first to receive that hash, which should be approximately always. Probably always, in practice. But in the very improbable case of a hash collision, one of them (the second to be turned into a ustring) will be using the alternate hash based on the character address, which is both not the same as ustring::strhash(chars), nor is it expected to be the same constant on every program execution. Signed-off-by: Larry Gritz <lg@larrygritz.com>

lgritz · 2024-08-07T04:56:27Z

Comments @cmstein @sfriedmapixar @etheory ?

lgritz · 2024-08-07T04:56:57Z

Maybe @chellmuth too?

sfriedmapixar

I implemented this locally with the MSB polarity flipped from what you've presented. I did this it's easier to see (in a debugger) pointer-based perfect hashes are pointers, it doesn't require bit manip to deref them, and so and if it happens, you can just dive into those in a debugger without having to mask off the high bit. Semantically I think it's also cleaner to actually use that pointer value as the hashing value into the map, and store it in the tablerep. Really, just think of it as a way to create a perfect hash out of an imperfect one -- with the caveat that it assumes you're on a system with user-space pointers that can never have the MSB set. You could do the same thing with the LSB instead since the tablerep always comes out of a pool, by guaranteeing the alignment of those allocations wouldn't have the low bit set for pointers and that would work everywhere. If you just think of the collision resolution as part of the "hashing" function, then it makes sense to store that hash in the tablerep for it in the "hashed" member.

sfriedmapixar · 2024-08-07T16:09:09Z

src/include/OpenImageIO/ustring.h

+    static OIIO_HOSTDEVICE constexpr hash_t strhash(string_view str)
+    {
+        return Strutil::strhash(str) & hash_mask;
+    }


In our implementation, we do an OR instead of an and, so that "pointer hashes" look like pointers in a debugger, and it saves clearing the bit when we want to use it like a pointer, and "hash-hashes" are always negative when looking at them in a debugger.

Interesting. I think I was trying to preserve the "0 means empty string" property without needing extra conditionals, but maybe it's easier than I thought. I will noodle with this and see if I can make it work just as simply with the reversed polarity, because I do see your point about how it's nice for the pointers to be the unchanged values.

sfriedmapixar · 2024-08-07T16:10:26Z

src/include/OpenImageIO/ustring.h

+        hash_t h = rep()->hashed;
+        return OIIO_LIKELY((h & duplicate_bit) == 0)
+                   ? h
+                   : hash_t(m_chars) | duplicate_bit;


With a "pointer hash", you can actually just store that "hash" the "hashed" member on TableRep creation, then you don't need to do any of this hashing or checking at all, you just return the value like before.

sfriedmapixar · 2024-08-07T16:11:29Z

src/include/OpenImageIO/ustring.h

@@ -757,10 +776,29 @@ class OIIO_UTIL_API ustring {
        TableRep(string_view strref, hash_t hash);
        ~TableRep();
        const char* c_str() const noexcept { return (const char*)(this + 1); }
+        constexpr bool unique_hash() const
+        {
+            return (hashed & duplicate_bit) == 0;


If you flip the polarity of the indicator bit, this needs to flip.

sfriedmapixar · 2024-08-07T16:12:27Z

src/libutil/ustring.cpp

@@ -71,6 +105,10 @@ template<unsigned BASE_CAPACITY, unsigned POOL_SIZE> struct TableRepMap {

    const char* lookup(string_view str, uint64_t hash)
    {
+        if (OIIO_UNLIKELY(hash & ustring::duplicate_bit)) {
+            // duplicate bit is set -- the hash is related to the chars!
+            return OIIO::bitcast<const char*>(hash & ustring::hash_mask);


If you flip the polarity of the indicator bit, the extra & is uneeded.

sfriedmapixar · 2024-08-07T16:15:01Z

src/libutil/ustring.cpp

@@ -98,6 +135,10 @@ template<unsigned BASE_CAPACITY, unsigned POOL_SIZE> struct TableRepMap {
    // the hash.
    const char* lookup(uint64_t hash)
    {
+        if (OIIO_UNLIKELY(hash & ustring::duplicate_bit)) {
+            // duplicate bit is set -- the hash is related to the chars!
+            return OIIO::bitcast<const char*>(hash & ustring::hash_mask);


If you flip the polarity of the indicator bit, the extra & is uneeded.

sfriedmapixar · 2024-08-07T16:15:46Z

src/libutil/ustring.cpp

            ++dist;
            pos = (pos + dist) & mask;  // quadratic probing
        }
    }

    const char* insert(string_view str, uint64_t hash)
    {
+        OIIO_ASSERT((hash & ustring::duplicate_bit) == 0);  // can't happen?


If you flip the polarity of the indicator bit, flip this assert.

sfriedmapixar · 2024-08-07T16:30:06Z

src/libutil/ustring.cpp

+        // If we encountered another ustring with the same hash (if one
+        // exists, it would have hashed to the same address so we would have
+        // seen it), set the duplicate bit in the rep's hashed field.
+        if (duplicate_hash) {


If you change the polarity of the indicator bit, you can pass "duplicate_hash" into make_rep, and then it can just store the pointer from the pool alloc in the "hashed" member, and all you have to do here is inc the collisions counter (or the inc and dbug print could be moved into make_rep as well. -- if you do that, you need to do another pobe for a valid "pos" here based on the pointer bits rather than the original hash bits.

sfriedmapixar · 2024-08-07T16:30:26Z

src/libutil/ustring.cpp

+        // ensure low bit clear
+        OIIO_DASSERT((size_t(rep->c_str()) & 1) == 0);
+        // ensure low bit clear
+        OIIO_DASSERT((size_t(rep->c_str()) & ustring::duplicate_bit) == 0);


If you change the polarity of the indicator bit, you need to flip this assert.

sfriedmapixar · 2024-08-07T16:32:48Z

src/libutil/ustring.cpp

-    hash_t hash = Strutil::strhash64(strref);
-    // This line, if uncommented, lets you force lots of hash collisions:
-    // hash &= ~hash_t(0xffffff);
+    hash_t hash = ustring::strhash(strref);



You can use a similar "force a lot of collisions" to to locally test that the collision code is all correct, now that it's much more complicated with the ptr vs hash perfect hashing.

sfriedmapixar · 2024-08-07T16:34:06Z

src/libutil/ustring.cpp

-            collisions->emplace_back(ustring::from_unique(c.first));
-    return all_hash_collisions.size();
+    if (collisions) {
+        // Disabled for now


You don't need to keep a collection of all hash collisions if you put the pointer into the "hashed" member on collision...then you can just iterate through all the strings in the system, and print out those who's hash is actually a pointer.

etheory · 2024-08-08T05:52:14Z

I'm OK with any of these options as long as collisions are handled.
I do agree with @sfriedmapixar about the simplicity of debugging if the top bit is 1 for a hash and 0 for a string though.

Also remember that on every current system, the bottom few bits are zero due to alignment, and the top few are zero due to memory allocation guarantees: https://stackoverflow.com/questions/46509152/why-in-x86-64-the-virtual-address-are-4-bits-shorter-than-physical-48-bits-vs which means you can often rely on the top 12 to 16bits also being zero. Keep that in mind when you do any of these tricks. We use these bits in glimpse for storage of data, so it's quite useful.

lgritz · 2024-08-08T06:09:20Z

Something that has been nagging at my mind... I know that the MSB will always be 0 for a real address on x86, but do we feel confident that this will also be true for ARM, RISC V, or others that may come along?

I also considered using the LSB for this, also safe because the allocations are always aligned, the characters will never start on an odd address, so if we "|1" the hashes to set the low bit, then that's also a simple way to distinguish pointers from hashes and will be true of any future architecture.

Thoughts?

lgritz · 2024-08-08T06:12:09Z

The reason I really like empty string to be all 0 bits (and thus thought that must mean that it's pointers, not hashes, with the extra bit set) is so we can memset to 0 a region and know that we've made valid (empty) ustrings or ustringhashes.

ThiagoIze · 2024-08-08T22:41:45Z

I also considered using the LSB for this, also safe because the allocations are always aligned, the characters will never start on an odd address, so if we "|1" the hashes to set the low bit, then that's also a simple way to distinguish pointers from hashes and will be true of any future architecture.

Setting the LSB hash will reduce the hash quality in a risky way. Suppose for instance someone did: my_array[hash%my_array.length()] where the length is 8 (it could be any size, but this is simple to show). The hash%8 will result in just the three lowest signed bits being used. But note that your proposal has the LSB at 1, so now we just have 2 bits in play and so only half the array will be referenceable.

lgritz · 2024-08-08T23:10:42Z

Right, gotcha. That's a good reason to stick with MSB.

It's ok, we'll have asserts that no real address has the high bit set. We'll have faith that we'll catch it right away if anybody ports to a hardware platform that might use that part of the address space.

lgritz · 2024-08-09T22:03:29Z

OK, in my tree (not yet pushed here) I recoded it all to make the MSB 1 for "original" hashes, and 0 for rehashes (which are replaced by the char ptr as the return value for ustring.hash()). I believe it's all working, but there is a tradeoff involved that I'm not entirely happy with. Let me review the issues:

In the table (invisible to the user), each entry has a hashed field, and the chars themselves. The MSB of the hashed field reveals if this is the first entry with that hash (used to be 0, now is 1 in my new draft), or if it is a subsequently encountered string that had the same 31 bit hash. The public hash() function returns rep->hashed if it's a first string entry having that hash, or the char address if it's a duplicate hash.

A property we've always had that I think is very important to retain is that both ustring and ustringhash are represented by a true 0 value for empty strings. In other words, it's special -- the ptr is null and the hash should be 0. This is what lets us zero out a block of memory and know that it's initializing any ustring or ustringhash therein to be valid, and represent an empty string.

Now then, the way I did it first in this PR trivially preserved that property, because MSB was 1 only for duplicate hashes, so the empty string is canonical and hashes to 0. The alternate way, where it would "want" to be a MSB 1 for canonical hashes, requires special handling of the empty string case in a variety of places. I think this is a little confusing and error-prone, and certainly has a runtime cost for the extra testing and branching.

@ssteinbach and others pointed out that it's nice for the pointers to have the MSB=0 rather than =1, because then you can just cast it to a pointer to get to the characters, especially helpful when debugging.

But here's where I want to push back a little: Is it, really? The only case where you'd get a hash that's a pointer is for a 2nd string that had the same hash. This should be exceptionally rare, so it's not something that would come up... almost never... while in the debugger looking at a ustringhash. Usually you'd just see the hash, so there's not a direct way to get to the chars, so it doesn't matter if MSB is 0 or 1 for that case. (And if we set MSB on the pointer, that just means to get to the chars, you'd need to mask it... awkward, but how often does that really come up? You could also just CALL -- yes, in the debugger -- ustringhash.c_str() to get to the chars.)

So here's the crux of it:

MSB=1 for duplicate hash:

PRO: no special handling of empty string case to make the hash 0
CON: "pointer substituted for duplicate hash" needs to have MSB masked off to get to the chars, which is painful in the debugger but almost never happens.

MSB=0 for duplicate hash:

PRO: duplicate hashes trivially castble to pointers, easy in the debugger, but this almost never happens.
CON: extra operations and potentially a branch to ensure ustringhash("") == 0, in several places that actually execute frequently.

Thoughts? Are people willing to pay (slightly) in runtime perf to have this property of trivial cast of duplicate hashes to pointers in the debuger, that you'll almost never encounter in real life?

etheory · 2024-08-10T05:41:40Z

You know what, add a static function to do the cast in the debugger and you solve all the issues. Should be straight forward enough. Then you can just call the correct function in the debugger directly. Just document it clearly and that should resolve the issue no?

lgritz · 2024-08-10T14:46:57Z

You know what, add a static function to do the cast in the debugger and you solve all the issues.

Isn't that just the existing ustringhash.c_str()?

etheory · 2024-08-12T01:42:57Z

You know what, add a static function to do the cast in the debugger and you solve all the issues.

Isn't that just the existing ustringhash.c_str()?

If you only have the pointer for some reason, you cannot call the function on it (though I guess you could cast it....).
My point is that if you provide a free function to convert the bits for display/conversion, then you can just call it at any time in a debugger in any situation. Whether you have the object reference or not. If you can just cast the address and call the function, then sure, it's fine as is.

lgritz · 2024-08-12T05:16:33Z

If you have the pointer, you already have the pointer.

If you have a hash value h as a ustringhash, you get to the chars with h.c_str().

If you have a hash value v as a uint64, you get to the chars with ustringhash(v).c_str().

lgritz · 2024-08-15T17:53:24Z

Does my last answer assuage concerns, or do we need an additional utility function?

etheory · 2024-08-17T14:59:45Z

Yep all good thanks @lgritz - sorry for the slow turn around, I was on leave.

lgritz · 2024-08-25T20:02:17Z

@sfriedmapixar you had asked about the stab I took at swapping the bit to the other way as you originally suggested. If you're curious, it's here: https://github.com/lgritz/OpenImageIO/tree/lg-ustringhash2 (just one additional commit on top of what I have here in this PR)

I like this PR here better, as it avoids some extra logic (both potentially perf impacting, but maybe more importantly, error prone and easy to forget) that is necessary to special case empty string handing.

Let me know if you feel really strongly about doing it the other way. If you're ok with this, though, I'd like to merge it.

sfriedmapixar suggested changes Aug 7, 2024

View reviewed changes

Merge branch 'master' into lg-ustringhash

3fa0a3b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ustring): ustring hash collision protection #4350

feat(ustring): ustring hash collision protection #4350

lgritz commented Jul 21, 2024

lgritz commented Aug 7, 2024

lgritz commented Aug 7, 2024

sfriedmapixar left a comment

sfriedmapixar Aug 7, 2024

lgritz Aug 8, 2024

sfriedmapixar Aug 7, 2024

sfriedmapixar Aug 7, 2024

sfriedmapixar Aug 7, 2024

sfriedmapixar Aug 7, 2024

sfriedmapixar Aug 7, 2024

sfriedmapixar Aug 7, 2024

sfriedmapixar Aug 7, 2024

sfriedmapixar Aug 7, 2024

sfriedmapixar Aug 7, 2024

etheory commented Aug 8, 2024

lgritz commented Aug 8, 2024

lgritz commented Aug 8, 2024

ThiagoIze commented Aug 8, 2024

lgritz commented Aug 8, 2024

lgritz commented Aug 9, 2024

etheory commented Aug 10, 2024

lgritz commented Aug 10, 2024

etheory commented Aug 12, 2024

lgritz commented Aug 12, 2024

lgritz commented Aug 15, 2024

etheory commented Aug 17, 2024

lgritz commented Aug 25, 2024

feat(ustring): ustring hash collision protection #4350

Are you sure you want to change the base?

feat(ustring): ustring hash collision protection #4350

Conversation

lgritz commented Jul 21, 2024

lgritz commented Aug 7, 2024

lgritz commented Aug 7, 2024

sfriedmapixar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etheory commented Aug 8, 2024

lgritz commented Aug 8, 2024

lgritz commented Aug 8, 2024

ThiagoIze commented Aug 8, 2024

lgritz commented Aug 8, 2024

lgritz commented Aug 9, 2024

etheory commented Aug 10, 2024

lgritz commented Aug 10, 2024

etheory commented Aug 12, 2024

lgritz commented Aug 12, 2024

lgritz commented Aug 15, 2024

etheory commented Aug 17, 2024

lgritz commented Aug 25, 2024