Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discover faster implementation for findIndexR. #468

Conversation

lehins
Copy link
Contributor

@lehins lehins commented Sep 18, 2023

Currently it is only in the benchmarks, but we should change the actual implementation as well. This commit also adds a benchmark for findIndexR that has to iterate the whole vector

Benchmark algorithms: RUNNING...
All
  findIndexR:               OK (0.18s)
    134  μs ±  11 μs
  findIndexR_naïve:         OK (0.48s)
    72.6 ms ± 4.4 ms
  findIndexR_manual:        OK (0.23s)
    110  μs ± 5.5 μs
  findIndexR (none):        OK (0.15s)
    4.92 ms ± 416 μs
  findIndexR_naïve (none):  OK (0.54s)
    77.1 ms ± 6.7 ms
  findIndexR_manual (none): OK (0.13s)
    4.05 ms ± 358 μs

This is the version from master:

Benchmark algorithms: RUNNING...
All                 
  findIndexR:        OK (0.18s)
    136  μs ±  11 μs
  findIndexR_naïve:  OK (0.47s)
    69.8 ms ± 6.5 ms
  findIndexR_manual: OK (0.23s)
    221  μs ±  11 μs

Currently it is only in the benchmarks, but we should change the actual
implementation as well. This commit also adds a benchmark for
`findIndexR` that has to iterate the whole vector
@Shimuuar
Copy link
Contributor

Note that implementation differ in rather pathological case: VG.findIndexR (const False) $ V.replicate 10 (undefined :: Int). Current implementation is as strict in element as predicate so expression above will return Nothing while string version will return ⊥.

Also benchmark is a bit pathological. It disallows inlining of predicate so GHC has to box value. I wonder how things will change if GHC get to see that predicate is strict and it could pass unboxed double.

@lehins
Copy link
Contributor Author

lehins commented Sep 18, 2023

Current implementation is as strict in element as predicate so expression above will return Nothing while string version will return ⊥.

This benchmark is for unboxed vectors which can't store bottom. This means we could use this implementation for Unboxed, Primitive and Storable, but not for Boxed, which is already slower anyways. Am I wrong about this?

None of the combinations of INLINE and INLINEABLE made a difference on performance, eg this benchmark:

, bench "findIndexR_inlined" $ nf findIndexR_inline ( \ !x -> x < indexFindThreshold, as)
findIndexR_inline :: (Double -> Bool, Vector Double) -> Maybe Int
{-# INLINE findIndexR_inline #-}
findIndexR_inline (pred, v) = go $ V.length v - 1
  where go i | i < 0                              = Nothing
             | inline (pred (V.unsafeIndex v i))  = Just i
             | otherwise                          = go $ i-1

has the same performance as the one without any inlining on master.

@Shimuuar
Copy link
Contributor

It doesn't change anything because nf prevents inlining anyway. Below are benchmarks result on my machine (commit 5935afa):

  findIndexR:               OK (0.26s)
    187  μs ±  11 μs
  findIndexR_manual:        OK (0.19s)
    169  μs ±  12 μs
  findIndexR (none):        OK (0.21s)
    6.79 ms ± 419 μs
  findIndexR_manual (none): OK (0.20s)
    6.16 ms ± 429 μs

However. If I allow partial application of predicate like:

findIndexR :: (Double -> Bool) -> Vector Double -> Maybe Int
{-# INLINE findIndexR #-}
findIndexR_manual :: (Double -> Bool) -> Vector Double -> Maybe Int
{-# INLINE findIndexR_manual #-}
    , bench "findIndexR" $ nf (findIndexR (<indexFindThreshold)) as

Results change drastically:

  findIndexR:               OK (0.18s)
    53.4 μs ± 5.3 μs
  findIndexR_manual:        OK (0.23s)
    209  μs ±  11 μs
  findIndexR (none):        OK (0.20s)
    1.53 ms ±  86 μs
  findIndexR_manual (none): OK (0.12s)
    7.65 ms ± 698 μs

I'm not sure why hand rolled loop performed so much worse. I didn't look into core. But it seems that performance gains of handrolled loop are highly situational

@lehins
Copy link
Contributor Author

lehins commented Sep 19, 2023

Alright, than I think it makes sense to abandon this idea of optimizing findIndexR, especially since even the gains that I observed were tiny.

@Shimuuar Thank you for investigating it a bit further!

@lehins lehins closed this Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants