Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely slow and inefficient hash function for openArray[byte] for 64-bit systems #23678

Closed
arnetheduck opened this issue Jun 4, 2024 · 15 comments

Comments

@arnetheduck
Copy link
Contributor

arnetheduck commented Jun 4, 2024

Description

The murmurhash implemention uses 32-bit limbs (on 64-bit systems) and inserts range checking inside the loop causing it to perform extremely poorly due to branching and inability to optimize the loop for example by unrolling it.

In addition, it returns a 32-bit value for a 64-bit hash effectively biasing the hash function to half of the number space which has the potential to double collisions depending on how it's used.

Nim Version

77c0409

Current Output

No response

Expected Output

No response

Possible Solution

No response

Additional Information

No response

@arnetheduck
Copy link
Contributor Author

arnetheduck commented Jun 4, 2024

Here's what a profiler looks like in an application that tries to use a Table with a 20-byte key of uniform random distribution to do efficient lookups:

image

@ringabout
Copy link
Member

ringabout commented Jun 5, 2024

ref #12022
ref #11581

I copied an implementation of MurmurHash3_x64_128 from @timotheecour
https://github.com/timotheecour/vitanim/blob/master/murmur/murmur.nim

In https://github.com/PeterScott/murmur3

MurmurHash3_x64_128 is the best of the lot, if you're using a 64-bit machine. Its throughput is 250% higher than MurmurHash3_x86_32, but it has roughly the same latency. It has a 128-bit output.

@arnetheduck Would you mind checking the performance of MurmurHash3_x64_128 on your project using modified implementation?

@c-blake
Copy link
Contributor

c-blake commented Jun 5, 2024

The 32-bit output is even more an issue than the speed, IMO, as it is surprising. There was originally some concern about having the string hash values from the C/C++ backends match the hash values for the JavaScript backend. The minimum version of NodeJs/js/ECMAscript/whatever and in particular the BigInt support may have evolved since the last time this came up, but I imagine matching still matters?

In timing string hash functions for table lookup, per-hash time is often neglected (as shows up even in @ringabout's quote of PeterScott just above!) but keys in Table are often short. This is probably not exhaustive, but you want to at least 1) stay in L1 to avoid confusing CPU/system memory bandwidth interaction effects more sensitive to broader workloads with hashing perf., 2) being sure to measure both per-byte and per-hash calls, 3) do statistical error analysis so when comparing any two you have error bars, 4) measure on multiple generations of backend CPUs, not just whatever your laptop/desktop/whatever is because results vary a lot.

One way to do all of the above is loop over substrings of each length of the same string that fits well within L1 in several rounds, taking the minimum times across rounds to filter out system noise and dynamic CPU frequency scaling effects1 and then doing a linear regression on a t=a+b*n formula to get a+-σ_a, b+-σ_b which covers 1,2,3,2 and then over several machines to address 4). 2..3 CPUs is really just a minimum diversity of deployments. You might also check multiple backend C compilers, build modes like PGO, etc.

The most basic reason to use time not throughput is just that "time is additive". "kicks of noise" (visually like Brownian motions of atoms hitting pollen under a microscope seen by Brown) are added in. Throughput numbers/"reciprocal space" add fuel to an already confusing noise fire. Another reason why the a+b*n is a nice summary for hashers in particular is that it lets you easily calculate "break evens" - E.g., if hash1 has a higher a but a smaller b (a common situation) than hash2 then how many bytes n does a string need to be for it to be faster? t1=a1+n*b1=t2=a2+n*b2 yields (via t1=t2) n=(a1-a2)/(b2-b1) and you can even use uncertainty propagation to get an approximate range.

In terms of concrete string hasher recommendations, for a long while now I have been patching my Nim installs to when-conditionally #include a C impl of an early variant of Wang Yi's hash and just dealt with the const Table not matching run-time Table (another issue is that besides JS we need a Nim VM impl!) and its perf has been consistently near the top of the pack. My Nim port over at cligen/strUt.hashWY needs a attention, though (the equivalent of the C versions "big switch at entry") and there is a "lose 1..2 bits per round if you hash a hash" effect some people worry about.

A better recommendation might be Google's Farm Hash which does quite well with C backends of the past 5..8-ish years and has better stress tests of diverse deployments at Google. They have decent specializations to 16-32-64- byte strings. So, a 64-bit output farm hash would probably be a good default and my primary recommendation, but as noted whatever you do need both JavaScript and NimVM attention as well as Nim with pointers.

As just one deployment example/datapoint (non-PGO gcc, C impls), I can offer on an i7-6700k with times in clock cycles assuming 4.7GHz:

hasher       TimePerByte                  TimePerHash
mem-dlcl  : 0.01092 +- 0.00021 *$2 + 31.726  +- 0.059  *$3
mem-fold  : 0.03162 +- 0.00012 *$2 + 6.036   +- 0.012  *$3
mem-gx    : 0.04471 +- 0.00011 *$2 + 14.672  +- 0.011  *$3
mem-farm  : 0.16570 +- 0.00014 *$2 + -1.8574 +- 0.0039 *$3
mem-cb4   : 0.17094 +- 0.00017 *$2 + 4.5186  +- 0.0095 *$3
mem-cb2   : 0.17737 +- 0.00021 *$2 + 4.946   +- 0.012  *$3
mem-WY    : 0.20418 +- 0.00036 *$2 + 22.792  +- 0.025  *$3
mem-wy    : 0.20431 +- 0.00059 *$2 + 11.031  +- 0.028  *$3
mem-xx    : 0.23417 +- 0.00025 *$2 + 36.827  +- 0.041  *$3
mem-nim2  : 0.4461  +- 0.0012  *$2 + 13.848  +- 0.092  *$3
mem-rj    : 0.83094 +- 0.00025 *$2 + 17.465  +- 0.045  *$3
mem-mmr   : 0.86504 +- 0.00034 *$2 + -0.985  +- 0.019  *$3
mem-ph    : 0.89815 +- 0.00035 *$2 + 11.662  +- 0.016  *$3
mem-sip   : 0.97731 +- 0.00017 *$2 + 23.887  +- 0.019  *$3
mem-bz1   : 1.38698 +- 0.00029 *$2 + 0.327   +- 0.039  *$3
mem-cb3   : 1.39364 +- 0.00037 *$2 + 0.663   +- 0.042  *$3
mem-cb1   : 1.40357 +- 0.00097 *$2 + 4.155   +- 0.052  *$3
mem-gnu   : 2.08673 +- 0.00042 *$2 + 4.112   +- 0.025  *$3
mem-glib  : 2.08804 +- 0.00031 *$2 + 3.368   +- 0.038  *$3
mem-djb2  : 2.08806 +- 0.00019 *$2 + 3.352   +- 0.030  *$3
mem-fnv0a : 2.78310 +- 0.00017 *$2 + -0.137  +- 0.023  *$3
mem-fnv0  : 2.78315 +- 0.00021 *$2 + -0.193  +- 0.042  *$3
mem-fnv1a : 2.78325 +- 0.00024 *$2 + 1.239   +- 0.027  *$3
mem-fnv1  : 2.78330 +- 0.00023 *$2 + 1.258   +- 0.023  *$3
mem-sdbm  : 2.7855  +- 0.0017  *$2 + 2.029   +- 0.097  *$3
mem-elf   : 3.47865 +- 0.00022 *$2 + 2.701   +- 0.019  *$3
mem-nim   : 3.4805  +- 0.0021  *$2 + 3.341   +- 0.068  *$3
mem-pjw   : 4.28175 +- 0.00042 *$2 + 3.945   +- 0.039  *$3
mem-ELF   : 4.95375 +- 0.00022 *$2 + 3.268   +- 0.037  *$3
mem-crc   : 5.5681  +- 0.0010  *$2 + 3.52    +- 0.17   *$3

but the fastest of these either have poor distribution (fold is just xor'd 8byte words on top of each other) or use specific-architectural speed-ups like Intel SIMD (Dan Lemire's CL) or gx (SIMD & AES when avail, etc.). gcc used to be worse at compiling farm, though - WY was much faster back in 2017 and now is slower.

EDIT: I emailed @ringabout some C farm hash code to evaluate for porting to Nim/Js/NimVM, but it should be noted performance may vary a lot across Nim ports of C codes unless you are shipping assembly. That also applies to just taking PeterScott's analysis of his C impl. Case in point, "hash_mmr" above is some old variant of Murmur.

Footnotes

  1. There are times when you don't want to assess the perfectly happy path of min times but rather the whole messy distribution, but the problem there is that the mess depends on freakin' everything going on in the CPU/systems under measurement and so is almost the definition of irreproducible. Also, often when you first hash a string you've just copied it from somewhere or are hashing many. So, hot-every-cache times are not wildly irrelevant.

  2. For 3), you can probably use fitl with to be concrete something like BATCH_EMUL=1 ./unimpl|awk '{print $3,$2,$1*$2}'|fitl -s0 -c,=,n,b -b100 1 2 3; ./unimpl|awk '{print $3,$2,$1*$2}'|fitl -s0 -c,=,n,b -b100 1 2 3 which I use to measure the improvement of a system call batching example

@ringabout ringabout removed their assignment Jun 6, 2024
@demotomohiro
Copy link
Contributor

It seems Nim's runtime checks doesn't affect the speed of murmurhash so much.
I wrote a benchmark code and compared speed of release build and danger build.
It hashes 100000000, 11, 8 and 4 bytes data.
https://gist.github.com/demotomohiro/62530b5eb31ca3d38efe541d038c9675

nim c -r -d:release benchstrhash.nim

Looong string hash bench
Time: 85960μ sec
11 bytes
Time: 216825μ sec
8 bytes
Time: 181579μ sec
4 bytes
Time: 154506μ sec

nim c -r -d:danger benchstrhash.nim

Looong string hash bench
Time: 84530μ sec
11 bytes
Time: 174127μ sec
8 bytes
Time: 140961μ sec
4 bytes
Time: 120784μ sec

Benchmarked with latest devel Nim and GCC 13.2.1.

I changed murmurHash code so that backend C compiler can remove a some of Nim's runtime checks:
demotomohiro@71ac58b
I compiled nim c -r -d:release --passC:"-S -masm=intel" benchstrhash.nim and read assembly code.
There was no instrunction for a bound check inside main loop.

nim c -r -d:release benchstrhash.nim

Looong string hash bench
Time: 84454μ sec
11 bytes
Time: 176174μ sec
8 bytes
Time: 147779μ sec
4 bytes
Time: 140995μ sec

nim c -r -d:danger benchstrhash.nim

Looong string hash bench
Time: 84446μ sec
11 bytes
Time: 169371μ sec
8 bytes
Time: 134352μ sec
4 bytes
Time: 114318μ sec

It slightly improved the performance of murmurhash for small size input.
But it is tested only with my GCC. I'm not sure it worth to be merged to devel Nim.

@demotomohiro
Copy link
Contributor

This table shows hash sizes and probability of random collision:
https://en.wikipedia.org/wiki/Birthday_attack#Mathematics

In the case of using hash for Table, it works correctly even if there is a hash collision.
Hash collisions can make Table slower. How much it causes slow down depends on a frequency of collisions.

Table uses lowest n bits of a hash when the length of internal storage is 2^n.
So using 64bit hash doesn't affect the performance of Table when the length of internal storage is shorter than 2^32.
But when the length is longer than 2^32, 32 bits hash doesn't map to entire internal storage and causes many hash collisions.
If you know table size never exceeds 2^31 and you define Hash as int32, 32 bits hash can make internal storage size of Table a bit smaller than 64 bits hash without increasing hash collision frequency.

@c-blake
Copy link
Contributor

c-blake commented Jun 15, 2024

Thanks for the contribution, @demotomohiro. I don't think anyone disputes that a limited range of output Hash limits the scalability of tables using it. You can define your own hash(x: string), though. So, it's more a question of "how necessary" this scalability is for a stdlib default. That defense being made, I 100% agree that it is surprising (as I said initially) and in a bad way.

A maybe interesting alternative application area for a hash(string) is unique count estimation where you consider hash(x)/Hash.high as a sort of deterministic PRNG with the x as the "seed" and can estimate cardinality from how close you sample to 1 conceptually using nothing more than very basic probability, but this idea requires the output range of hash(x) to be at least well documented (which it is not) or else knowledge of internals which goes against a culture of hash functions being black boxes.

Anyway, more on-point, I have ported that Farm hash to pure Nim that works in the NimVM, JavaScript Backend, and C/C++ backends producing the same values as a C variant that passed all the SMHasher distribution tests. It will never be as fast as things leveraging CPU-specific features, but it uses 64 byte limbs and special cases strings shorter than that to be loop-less and should be a worthwhile improvement.

I can only benchmark on bare metal on about 3 or 4 CPU arch's and gcc-13 & clang-17 and might ask @demotomohiro & others to run a measurement program on Looong & maybe ARM/some other arch's before we really decide to go that way or other aspects of the implementation such as whether to when nimvm|js certain things that today's C compilers seem poor at mapping to efficient code themselves (at least as seen through the lens of current Nim C/C++ codegen on x86_64).

@c-blake
Copy link
Contributor

c-blake commented Jun 15, 2024

So, with the below Nim program (and cligen installed, compiled with -d:ubench and PGO under both clang-17 and gcc-13 on A)lder-L)ake & S)ky-L)ake), I get the attached graph (in gnuplot you just roughly plot 'output' using 4:1).

It's just 2 CPUs and (on purpose) almost a "maximally micro"-benchmark with the required caveats that these days CPU resources themselves are a complex network of caches & queues so "costs in context" can be very different than isolated costs. You can clearly see the step functions from the loop structure/structure of the hash function and can also "ball park" the cost as 1/4 to 3/4 CPU cycles/byte which at ~5 GHz is roughly 20 to 7 GB/s. At least for me on Nim-devel it gives the same hash values under the JS and under the const / NimVM as C, and as mentioned I cross-checked it against hash values from a C impl that passed SMHasher tests for pseudo-randomness. (I don't really like the SMHasher speed evaluations for a few reasons we needn't go into, but the distribution analysis is decent).

As mentioned, one thing that is kind of nice about this one is that it doesn't rely on any CPU specific features for its speed (which of course throttles it in some other senses). What I would recommend is that @demotomohiro runs it on his Looong and someone else runs it on an RPi ARM and someone else runs it on Apple Silicon ARM and yet third/fourth person runs it on AWS or Google Cloud ARM (or maybe some one lucky person generous with their time and curiosity has access to all of those with both gcc & clang?).

type B = char # uint8 byte ..?
const k0 = 0xc3a5c85c97cb3127u64 # Primes on (2^63, 2^64) for various uses
const k1 = 0xb492b66fbe98f273u64 
const k2 = 0x9ae16a3b2f90404fu64 

proc load4e(s: openArray[B], o=0): uint32 {.inline.} =
  ((uint32(s[o + 3]) shl 24) or (uint32(s[o + 2]) shl 16) or
   (uint32(s[o + 1]) shl  8) or  uint32(s[o + 0]))

proc load8e(s: openArray[B], o=0): uint64 {.inline.} =
  ((uint64(s[o + 7]) shl 56) or (uint64(s[o + 6]) shl 48) or
   (uint64(s[o + 5]) shl 40) or (uint64(s[o + 4]) shl 32) or
   (uint64(s[o + 3]) shl 24) or (uint64(s[o + 2]) shl 16) or
   (uint64(s[o + 1]) shl  8) or  uint64(s[o + 0]))

proc load4(s: openArray[B], o=0): uint32 {.inline.} =
  when nimvm: load4e(s, o)
  else:
    when defined(js): load4e(s, o)
    else: copyMem result.addr, s[o].addr, result.sizeof

proc load8(s: openArray[B], o=0): uint64 {.inline.} =
  when nimvm: load8e(s, o)
  else:
    when defined(js): load8e(s, o)
    else: copyMem result.addr, s[o].addr, result.sizeof

proc lenU(s: openArray[B]): uint64 {.inline.} = s.len.uint64

proc shiftMix(v: uint64): uint64 {.inline.} = v xor (v shr 47)

proc rotR(v: uint64; bits: cint): uint64 {.inline.} =
  (v shr bits) or (v shl (64 - bits))

proc len16(u: uint64; v: uint64; mul: uint64): uint64 {.inline.} =
  var a = (u xor v)*mul
  a = a xor (a shr 47)
  var b = (v xor a)*mul
  b = b xor (b shr 47)
  b*mul

proc len0_16(s: openArray[B]): uint64 {.inline.} =
  if s.len >= 8:
    let mul = k2 + 2*s.lenU
    let a   = load8(s) + k2
    let b   = load8(s, s.len - 8)
    let c   = rotR(b, 37)*mul + a
    let d   = (rotR(a, 25) + b)*mul
    len16 c, d, mul
  elif s.len >= 4:
    let mul = k2 + 2*s.lenU
    let a   = load4(s).uint64
    len16 s.lenU + (a shl 3), load4(s, s.len - 4), mul
  elif s.len > 0:
    let a = uint32(s[0])
    let b = uint32(s[s.len shr 1])
    let c = uint32(s[s.len - 1])
    let y = a      + (b shl 8)
    let z = s.lenU + (c shl 2)
    shiftMix(y*k2 xor z*k0)*k2
  else: k2      # s.len == 0

proc len17_32(s: openArray[B]): uint64 {.inline.} =
  let mul = k2 + 2*s.lenU
  let a = load8(s)*k1
  let b = load8(s, 8)
  let c = load8(s, s.len - 8)*mul
  let d = load8(s, s.len - 16)*k2
  len16 rotR(a + b, 43) + rotR(c, 30) + d, a + rotR(b + k2, 18) + c, mul

proc len33_64(s: openArray[B]): uint64 {.inline.} =
  let mul = k2 + 2*s.lenU
  let a = load8(s)*k2
  let b = load8(s, 8)
  let c = load8(s, s.len - 8)*mul
  let d = load8(s, s.len - 16)*k2
  let y = rotR(a + b, 43) + rotR(c, 30) + d
  let z = len16(y, a + rotR(b + k2, 18) + c, mul)
  let e = load8(s, 16)*mul
  let f = load8(s, 24)
  let g = (y + load8(s, s.len - 32))*mul
  let h = (z + load8(s, s.len - 24))*mul
  len16 rotR(e + f, 43) + rotR(g, 30) + h, e + rotR(f + a, 18) + g, mul

type Pair = tuple[first, second: uint64]

proc weakLen32withSeeds2(w, x, y, z, a, b: uint64): Pair {.inline.} =
  var a = a + w
  var b = rotR(b + a + z, 21)
  let c = a
  a += x
  a += y
  b += rotR(a, 44)
  result[0] = a + z
  result[1] = b + c

proc weakLen32withSeeds(s: openArray[B]; o: int; a,b: uint64): Pair {.inline.} =
  weakLen32withSeeds2 load8(s, o     ), load8(s, o + 8),
                      load8(s, o + 16), load8(s, o + 24), a, b

proc hashFarm*(s: openArray[B]): uint64 {.inline.} =
  if s.len <= 16: return len0_16(s)
  if s.len <= 32: return len17_32(s)
  if s.len <= 64: return len33_64(s)
  const seed = 81u64 # not const to use input `h`
  var
    o = 0         # s[] ptr arith -> variable origin variable `o`
    x = seed
    y = seed*k1 + 113
    z = shiftMix(y*k2 + 113)*k2
    v, w: Pair
  x = x*k2 + load8(s)
  let eos    = ((s.len - 1) div 64)*64
  let last64 = eos + ((s.len - 1) and 63) - 63
  while true:
    x = rotR(x + y + v[0] + load8(s, o+8), 37)*k1
    y = rotR(y + v[1] + load8(s, o+48), 42)*k1
    x = x xor w[1]
    y += v[0] + load8(s, o+40)
    z = rotR(z + w[0], 33)*k1
    v = weakLen32withSeeds(s, o+0 , v[1]*k1, x + w[0])
    w = weakLen32withSeeds(s, o+32, z + w[1], y + load8(s, o+16))
    swap z, x
    inc o, 64
    if o == eos: break
  let mul = k1 + ((z and 0xff) shl 1)
  o = last64
  w[0] += (s.lenU - 1) and 63
  v[0] += w[0]
  w[0] += v[0]
  x = rotR(x + y + v[0] + load8(s, o+8), 37)*mul
  y = rotR(y + v[1] + load8(s, o+48), 42)*mul
  x = x xor w[1]*9
  y += v[0]*9 + load8(s, o+40)
  z = rotR(z + w[0], 33)*mul
  v = weakLen32withSeeds(s, o+0 , v[1]*mul, x + w[0])
  w = weakLen32withSeeds(s, o+32, z + w[1], y + load8(s, o+16))
  swap z, x
  len16 len16(v[0],w[0],mul) + shiftMix(y)*k0 + z, len16(v[1],w[1],mul) + x, mul

type Hash = int64
proc hash*(x: openArray[B]): Hash {.inline.} = x.hashFarm.Hash

when isMainModule:
  when defined testValues:
    import std/os
    for s in commandLineParams(): echo s.hash,"\t",s.len,"\t",s
  elif defined ubench:
    import std/[times, monotimes, strformat]
    proc hashTm(reps=100, mins=30, offs=0..0, lens=1..128, ghz=4.6) =
      ## `farm | fitl -cn -cb -b999 1 2 3` -> nsec = perByte*bytes + perHash
      ## time formula with bootstrapped error bars on the two coefficients.
      ## CHECK `reps` ALTERS `dt` *TO ENSURE COMPILER DIDN'T DO LOOP->MULTIPLY*.
      var s = ""                  # Hashes fast relative to mem xfer=>stay in L1
      for i in 0 .. (offs.b + lens.b): s.add chr(ord('A') + i mod 64)
      for n in lens:              # Can batch work => study many lens
        for o in offs:            # Perf can vary across align => study|avg over
          let p = cast[ptr UncheckedArray[char]](s[o].addr)
          var h {.volatile.}: uint64
          var dt = int64.high     # min(mins trials) hides CPU ramp up delay..
          for trial in 1..mins:   #..or interrupts/competing load/etc.
            let t0 = getMonoTime()
            for r in 1..reps:     # Hash fast relative to tm qry => Loop
              h += hashFarm(toOpenArray[char](p, 0, n - 1))
            dt = min(dt, (getMonoTime() - t0).inNanoseconds)
          echo &"{dt.float*ghz/reps.float:.1f} {reps*n} {reps} {n} {o} {h}"
    import cligen; dispatch hashTm, help={
      "reps": "Repeat Loops for dt", "mins": "num Loops to min(dt) over",
      "offs": "offsets to do", "lens": "string lens", "ghz": "cvt ns -> cycles"}
  elif B is char:
    const x = hash("1234"); echo x # 882600748797058222
    let   y = hash("1234"); echo y # 882600748797058222

farm7clGccAlderSky

For longer strings the step function more or less continues at 64B boundaries.

@arnetheduck
Copy link
Contributor Author

arnetheduck commented Jun 18, 2024

We've moved on since to a different hash function and setup entirely so testing this one would be quite involved.

That said, anything is better than the current one, specially if it fixes the 32-bit bug ;) @c-blake looks like he's on top of better measuring performance over a more diverse input range already though - the most important thing is to establish a minimally correct baseline which means that as long as we're getting a full 64-bit range out of it, it's already going to be a massive improvement which should be merged asap.

Once that's done, one can investigate more modern options such as https://github.com/ogxd/gxhash/blob/main/article/article.pdf

If you know table size never exceeds 2^31 and you define Hash as int32, 32 bits hash can make internal storage size of Table a bit smaller than 64 bits hash without increasing hash collision frequency.

The specifics of Table in particular is somewhat beside the point (it's only one potential consumer of Hash and not a very good one) - ie Hash is a 64-byte type so it should return that full range. Another way to fix things would be to make Hash 32-bit which is a common way of saving some space (since a lot of hash tables store the hash to avoid extra key conversions).

On this note, it's also quite unfortunate that it's an int which comes with range checking - an uint would make a lot more sense.

@c-blake
Copy link
Contributor

c-blake commented Jun 18, 2024

I think all agree what is most a problem is the counterintuitive restriction of Hash to 32-bits. This addresses that as well as giving a speed boost at the same time, as per the discussion in this issue.

The loop-less structure for <64B keys (by far the most "meta-common" key sets in my own experiences) alone should make this always faster or at least about the same. For large keys doing 64B/512b whole L1 cache lines at a time should be better from that. I can post a 3-way comparison program here people can run on their favorite backends / CPUs with PGO/not/etc. if there is interest.

I should also say, I had taken the @arnetheduck "20-byte key of uniform random distribution" as "merely an example", but IF the keys are already close to pseudorandom bits then you can also cast the first|last 4|8 bytes to a Hash and call it a day with essentially a zero cost hash function, leveraging that something else in the system has already created a hash. This is actually a great example use case of defining one's own proc hash.

Also, the "mem-gx" in my original table refers to that same gxhash that this references. Note (again) that one needs to be careful with very fast per-byte hashes that the per-hash time does not clobber your performance for small keys. My measurement is just 1 cpu/backend and in C to boot and may be off, but according to Measuremancer (14.672 +- 0.011)/(0.04471 +- 0.00011) = 328.16 +- 0.84. So, unless your strings are > 328 bytes long you may well be better off with a tiny per-Hash function.

@c-blake
Copy link
Contributor

c-blake commented Jun 18, 2024

Well, for the performance curious, since this issue did start with performance complaint, a Nim built with the referenced PR, and this little cligen utility program:

import std/[times, monotimes, strformat], std/hashes {.all.}
type Fun = enum orig, murmur, farm
proc hashTm(reps=100, mins=50, offs=0..0, lens=1..128, ghz=4.6, funs: seq[Fun])=
  ## `hashTm | fitl -cn -cb -b999 1 2 3` -> nsec = perByte*bytes + perHash
  ## time formula with bootstrapped error bars on the two coefficients.
  ## CHECK `reps` ALTERS `dt` *TO ENSURE COMPILER DIDN'T DO LOOP->MULTIPLY*.
  let funs = if funs.len > 0: funs else: @[orig, murmur, farm]
  var s = ""                    # Hashes fast relative to mem xfer=>stay in L1
  for i in 0 .. (offs.b + lens.b): s.add chr(ord('A') + i mod 64)
  for fun in funs:
    for n in lens:              # Can batch work => study many lens
      for o in offs:            # Perf can vary across aligns => study|avg over
        let p = cast[ptr UncheckedArray[byte]](s[o].addr)
        var h {.volatile.}: uint64
        var dt = int64.high     # min(mins trials) hides CPU ramp up delay..
        for trial in 1..mins:   #..or interrupts/competing load/etc.
          template doTm(expr) =
            let t0 = getMonoTime()
            for r in 1..reps: h += expr # Hash fast relative to tm qry => Loop
            dt = min(dt, (getMonoTime() - t0).inNanoseconds)
          case fun
          of orig  : doTm cast[uint64](hashData(p, n))
          of murmur: doTm cast[uint64](murmurHash(toOpenArray[byte](p, 0, n - 1)))
          of farm  : doTm hashFarm(toOpenArray[byte](p, 0, n - 1))
        echo &"{dt.float*ghz/reps.float:.1f} {reps*n} {reps} {n} {o} {h}"

import cligen; dispatch hashTm, help={
  "reps": "Repeat Loops for dt", "mins": "num Loops to min(dt) over",
  "offs": "offsets to do", "lens": "string lens", "ghz": "cvt ns -> cycles",
  "funs": "hash funs to time"}

this little driver script run-bench.sh compiles it 4 different ways and saves data:

#!/bin/sh
b="chrt 99 taskset 0xC env -i CLIGEN=/n PATH=$PATH"
cpuTag="$1"
xtra="$2"
nim c --cc:gcc -d:danger  tmHash && 0 $b ./tmHash $xtra > ${cpuTag}g
nim c --cc:clang -d:danger tmHash && 0 $b ./tmHash $xtra > ${cpuTag}c
# These are two PGO driver scripts I have around; Roll your own.
nim-pgo  tmHash './tmHash>/n' --mm:arc && 0 $b ./tmHash $xtra > ${cpuTag}gPG
nim-pgoc tmHash './tmHash>/n' --mm:arc && 0 $b ./tmHash $xtra > ${cpuTag}cPG

so that after ./run-bench A -g5.2 for a 5.2 GHz Alder Lake you can use this gnuplot script generator (plot.sh, say) to run ./plot.sh A Alder to make some plots:

#!/bin/sh
cpuT0=$1
cpuT1=$2
cat > doit.gpi <<EOF
set term png size 1920,1080 font "Helvetica,10"
set output "/t/3NimHashes${cpuT1}.png"
set yrange [0:156]
set ytics 10
set xtics 16
set grid
set key bottom right
plot "${cpuT0}c" u 4:1 t "${cpuT1} clang", \
     "${cpuT0}g" u 4:1 t "${cpuT1} gcc", \
     "${cpuT0}cPG" u 4:1 t "${cpuT1} clangPGO", \
     "${cpuT0}gPG" u 4:1 t "${cpuT1} gccPGO"
EOF
gnuplot doit.gpi

which I also did for a SkyLake with a run-bench.sh S and plot.sh S Sky. Attached are both plots. Just 2 cpus, of course.

3NimHashesAlder

3NimHashesSky

You can see all 3 Nim string hashes present after the PR as clearly separated clusters of 4 compilation modes (2 compilers * (pgo,not)). The improved performance of @narimiran 's murmur over the original hashData is quite clear as is the improvement of Farm hash over murmur. You can also see that a straight line regression is a less faithful summary of the Farm performance due to it's large stride step function character. (These are points not lines because I was lazy to split the data files and coordinate colors, but perf separation turned out to be vivid enough for this to work out.)

Anyway, I realize that it's only 2 CPUs, but I tried to "show my work" enough that anyone interested could do their own analysis on their own CPUs of interest. I do think that just reading the functions is an ok "first order" explanation of the performance differences (byte-at-a-time for the slowest, word-at-a-time for the middle, cache-line-at-a-time for the fastest). Perf tracking one's intuitions is always a good cross-check.

@demotomohiro
Copy link
Contributor

@c-blake
I modified your code and added MurmurHash3_x64_128 to compare performance of 4 hash functions.
I used @ringabout 's MurmurHash3_x64_128.nim.

It output each results to separate files and measure times with 1~256 bytes input.
tmHash.nim:

import std/[times, monotimes, strformat], std/hashes {.all.}
import MurmurHash3_x64_128

proc hashTm(reps=100, mins=50, offs=0..0, lens=1..256, ghz=4.6)=
  ## `hashTm | fitl -cn -cb -b999 1 2 3` -> nsec = perByte*bytes + perHash
  ## time formula with bootstrapped error bars on the two coefficients.
  ## CHECK `reps` ALTERS `dt` *TO ENSURE COMPILER DIDN'T DO LOOP->MULTIPLY*.
  var s = ""                    # Hashes fast relative to mem xfer=>stay in L1
  for i in 0 .. (offs.b + lens.b): s.add chr(ord('A') + i mod 64)
  for n in lens:              # Can batch work => study many lens
    for o in offs:            # Perf can vary across aligns => study|avg over
      let p = cast[ptr UncheckedArray[byte]](s[o].addr)
      var h {.volatile.}: uint64
      var dt = int64.high     # min(mins trials) hides CPU ramp up delay..
      for trial in 1..mins:   #..or interrupts/competing load/etc.
        let t0 = getMonoTime()
        for r in 1..reps:
          h += (when defined(orig):
                  cast[uint64](hashData(p, n))
                elif defined(murmur):
                  cast[uint64](murmurHash(toOpenArray[byte](p, 0, n - 1)))
                elif defined(farm):
                  hashFarm(toOpenArray[byte](p, 0, n - 1))
                elif defined(murmur3x64128):
                  cast[uint64](MurmurHash3_x64_128.hash3(toOpenArray[byte](p, 0, n - 1)))
                else:
                  {.error: "No define".}
                  0)
        dt = min(dt, (getMonoTime() - t0).inNanoseconds)
      echo &"{dt.float*ghz/reps.float:.1f} {reps*n} {reps} {n} {o} {h}"

import cligen; dispatch hashTm, help={
  "reps": "Repeat Loops for dt", "mins": "num Loops to min(dt) over",
  "offs": "offsets to do", "lens": "string lens", "ghz": "cvt ns -> cycles"}

run-bench.sh:

#!/bin/sh
b="taskset 0xC env -i CLIGEN=/n PATH=$PATH"
cpuTag="$1"
xtra="$2"
nim c --cc:gcc -d:danger -d:orig tmHash && $b ./tmHash $xtra > orig${cpuTag}
nim c --cc:gcc -d:danger -d:murmur tmHash && $b ./tmHash $xtra > murmur${cpuTag}
nim c --cc:gcc -d:danger -d:farm tmHash && $b ./tmHash $xtra > farm${cpuTag}
nim c --cc:gcc -d:danger -d:murmur3x64128 tmHash && $b ./tmHash $xtra > murmur3x64128${cpuTag}
#nim c --cc:clang -d:danger tmHash && 0 $b ./tmHash $xtra > ${cpuTag}c
# These are two PGO driver scripts I have around; Roll your own.
#nim-pgo  tmHash './tmHash>/n' --mm:arc && 0 $b ./tmHash $xtra > ${cpuTag}gPG
#nim-pgoc tmHash './tmHash>/n' --mm:arc && 0 $b ./tmHash $xtra > ${cpuTag}cPG

Extended yrange so that I can see the result of Raspberry Pi 3.
plot.sh:

#!/bin/sh
cpuT0=$1
cpuT1=$2
cat > doit.gpi <<EOF
set term png size 1920,1080
set output "3NimHashes${cpuT1}.png"
set yrange [0:500]
set ytics 50
set xtics 16
set grid
set key bottom right
plot "orig${cpuT0}" u 4:1 t "${cpuT1} gcc-orig", \
     "murmur${cpuT0}" u 4:1 t "${cpuT1} gcc-murmur", \
     "farm${cpuT0}" u 4:1 t "${cpuT1} gcc-farm", \
     "murmur3x64128${cpuT0}" u 4:1 t "${cpuT1} gcc-murmur3x64128"
EOF
gnuplot doit.gpi

I ran on my Ivy Bridge CPU:
Used Latest devel Nim:

./run-bench.sh Ivy -g2.1
sh plot.sh Ivy ivy

3NimHashesivy

On Raspberry Pi 3 (Compiled with latest successful build for branch devel in Nim Nightly)

./run-bench.sh RP3 -g1.2
sh plot.sh RP3 rp3

3NimHashesrp3

Farm hash is fastest on both CPU and it seems MurmurHash3_x64_128 has performance problems in the loop that hashes remaining bytes.
MurmurHash3_x64_128 faster than murmurHash in stdlib when input size is larger than 32 bytes and a mutiple of 16 but slow down when (input size mod 16) is larger.

I have a question in your benchmark code (tmHash.nim):

let t0 = getMonoTime()
for r in 1..reps: h += expr # Hash fast relative to tm qry => Loop
dt = min(dt, (getMonoTime() - t0).inNanoseconds)

In this most inner loop, the hash function is repeatedly called with the same input.
A Hash function is a function and always returns the same result from the same input.
So I think Nim or backend C compiler can do an optimization that move the hash function call to outside of the most inner loop.
https://en.wikipedia.org/wiki/Loop-invariant_code_motion
GCC supports the loop invariant motion:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-fmove-loop-invariants
But it seems such a optimization is not done.
I thought volatile pragma to h variable prevented that optimization, but I measured almost same time after removing volatile pragma.
Is this just because GCC failed to detect that the hash function called inside the loop is a pure function?

@c-blake
Copy link
Contributor

c-blake commented Jun 22, 2024

First, thank you for all your hard work, showing your work and your plots that help confirm my recommendation for Farm hash! (Showing some improvement even for 16 Byte strings and much for 32B, as it should be intuitively, but it is best to confirm intuition with measurement when not too hard!). You had mentioned using a Looong CPU, right? Might be nice to see that, too or at least hear the confirmation in words. I also confirmed the basic recommendations/stylized results on an AMD Zen2.

My PR adding that under nimPreviewFarmHash was already merged, but I guess there is some (unexpected by me) openness to fast-tracking this to be the default for Nim-2.2. So, maybe the when s need to be reversed to opt-out rather than opt-in to the change. With either approach, though, I expect we could close this issue.

Second, Re: your question:

Is this just because GCC failed to detect that the hash function called inside the loop is a pure function?

In short, yes. A bit more detail - the volatile is intended to prevent the optimization (and seemed to be needed with clang for me), but code complexity alone make it too hard for gcc to see the purity. I do have that comment (CHECK reps ALTERS dt) because I think even with volatile that a "sufficiently smart" compiler might still be able to convert the h+= into a multiply defeating the purpose of the loop to enlarge the time period being measured. (It would be more bulletproof to automate that check, of course, but I haven't (yet!) seen the need.)

@demotomohiro
Copy link
Contributor

@c-blake, thank you for your reply, great benchmark code and added fast hash function to Nim's stdlib!

Ivy Bridge CPU is the same CPU I used to run the benchmark code in my first post on this issue.
Benchmark code in my first post runs Murmur hash in hashes module.
Following message means it run the hash function with very long string (100,000,000 bytes) and print measured time.

Looong string hash bench
Time: 85960μ sec

I just put extra 'o' to 'Long' just because the input string size is large.
I should have wrote correct and clear message.

Is this just because GCC failed to detect that the hash function called inside the loop is a pure function?

In short, yes. A bit more detail - the volatile is intended to prevent the optimization (and seemed to be needed with clang for me), but code complexity alone make it too hard for gcc to see the purity. I do have that comment (CHECK reps ALTERS dt) because I think even with volatile that a "sufficiently smart" compiler might still be able to convert the h+= into a multiply defeating the purpose of the loop to enlarge the time period being measured. (It would be more bulletproof to automate that check, of course, but I haven't (yet!) seen the need.)

Changing reps and see if dt is changed is easy to check if compiler did some optimization and I cannot correclty measure what I want to measure.
If gcc/clang become smarter and finds the hash function is pure and converts the h+= into a multiply, I think changing the input string inside the loop prevents such an optimization.
For example:

for r in 1..reps:
  h += expr
  s[o] = r.char

But it adds small noise to the measured time.

@c-blake
Copy link
Contributor

c-blake commented Jun 24, 2024

Sorry. I guess the CPU is only 2 'o' Loong anyway with a "-son" suffix (often) and I didn't read your code carefully enough. 🤦 Re: s[0]=r.char, an ok plan.. I didn't want to create cache-write traffic until compilers make it needed.

@narimiran
Copy link
Member

Fixed with #23793

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants