Discover faster implementation for `findIndexR`. #468

lehins · 2023-09-18T18:42:13Z

Currently it is only in the benchmarks, but we should change the actual implementation as well. This commit also adds a benchmark for findIndexR that has to iterate the whole vector

Benchmark algorithms: RUNNING...
All
  findIndexR:               OK (0.18s)
    134  μs ±  11 μs
  findIndexR_naïve:         OK (0.48s)
    72.6 ms ± 4.4 ms
  findIndexR_manual:        OK (0.23s)
    110  μs ± 5.5 μs
  findIndexR (none):        OK (0.15s)
    4.92 ms ± 416 μs
  findIndexR_naïve (none):  OK (0.54s)
    77.1 ms ± 6.7 ms
  findIndexR_manual (none): OK (0.13s)
    4.05 ms ± 358 μs

This is the version from master:

Benchmark algorithms: RUNNING...
All                 
  findIndexR:        OK (0.18s)
    136  μs ±  11 μs
  findIndexR_naïve:  OK (0.47s)
    69.8 ms ± 6.5 ms
  findIndexR_manual: OK (0.23s)
    221  μs ±  11 μs

Currently it is only in the benchmarks, but we should change the actual implementation as well. This commit also adds a benchmark for `findIndexR` that has to iterate the whole vector

Shimuuar · 2023-09-18T19:24:09Z

Note that implementation differ in rather pathological case: VG.findIndexR (const False) $ V.replicate 10 (undefined :: Int). Current implementation is as strict in element as predicate so expression above will return Nothing while string version will return ⊥.

Also benchmark is a bit pathological. It disallows inlining of predicate so GHC has to box value. I wonder how things will change if GHC get to see that predicate is strict and it could pass unboxed double.

lehins · 2023-09-18T20:15:43Z

Current implementation is as strict in element as predicate so expression above will return Nothing while string version will return ⊥.

This benchmark is for unboxed vectors which can't store bottom. This means we could use this implementation for Unboxed, Primitive and Storable, but not for Boxed, which is already slower anyways. Am I wrong about this?

None of the combinations of INLINE and INLINEABLE made a difference on performance, eg this benchmark:

, bench "findIndexR_inlined" $ nf findIndexR_inline ( \ !x -> x < indexFindThreshold, as)

findIndexR_inline :: (Double -> Bool, Vector Double) -> Maybe Int
{-# INLINE findIndexR_inline #-}
findIndexR_inline (pred, v) = go $ V.length v - 1
  where go i | i < 0                              = Nothing
             | inline (pred (V.unsafeIndex v i))  = Just i
             | otherwise                          = go $ i-1

has the same performance as the one without any inlining on master.

Shimuuar · 2023-09-19T19:50:06Z

It doesn't change anything because nf prevents inlining anyway. Below are benchmarks result on my machine (commit 5935afa):

  findIndexR:               OK (0.26s)
    187  μs ±  11 μs
  findIndexR_manual:        OK (0.19s)
    169  μs ±  12 μs
  findIndexR (none):        OK (0.21s)
    6.79 ms ± 419 μs
  findIndexR_manual (none): OK (0.20s)
    6.16 ms ± 429 μs

However. If I allow partial application of predicate like:

findIndexR :: (Double -> Bool) -> Vector Double -> Maybe Int
{-# INLINE findIndexR #-}
findIndexR_manual :: (Double -> Bool) -> Vector Double -> Maybe Int
{-# INLINE findIndexR_manual #-}

    , bench "findIndexR" $ nf (findIndexR (<indexFindThreshold)) as

Results change drastically:

  findIndexR:               OK (0.18s)
    53.4 μs ± 5.3 μs
  findIndexR_manual:        OK (0.23s)
    209  μs ±  11 μs
  findIndexR (none):        OK (0.20s)
    1.53 ms ±  86 μs
  findIndexR_manual (none): OK (0.12s)
    7.65 ms ± 698 μs

I'm not sure why hand rolled loop performed so much worse. I didn't look into core. But it seems that performance gains of handrolled loop are highly situational

lehins · 2023-09-19T20:46:55Z

Alright, than I think it makes sense to abandon this idea of optimizing findIndexR, especially since even the gains that I observed were tiny.

@Shimuuar Thank you for investigating it a bit further!

Discover faster implementation for findIndexR.

5935afa

Currently it is only in the benchmarks, but we should change the actual implementation as well. This commit also adds a benchmark for `findIndexR` that has to iterate the whole vector

lehins mentioned this pull request Sep 18, 2023

Is there an efficient way to break/span/findIndex from the right end? #172

Closed

lehins closed this Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discover faster implementation for `findIndexR`. #468

Discover faster implementation for `findIndexR`. #468

lehins commented Sep 18, 2023

Shimuuar commented Sep 18, 2023

lehins commented Sep 18, 2023

Shimuuar commented Sep 19, 2023

lehins commented Sep 19, 2023

Discover faster implementation for findIndexR. #468

Discover faster implementation for findIndexR. #468

Conversation

lehins commented Sep 18, 2023

Shimuuar commented Sep 18, 2023

lehins commented Sep 18, 2023

Shimuuar commented Sep 19, 2023

lehins commented Sep 19, 2023

Discover faster implementation for `findIndexR`. #468

Discover faster implementation for `findIndexR`. #468