Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove inline from cython classes #49873

Merged
merged 9 commits into from
Dec 6, 2022

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Nov 23, 2022

Part of #33926 adding -Winline to the compiler options for me generates a 14 MB file with 157,000 lines of warnings about how functions aren't getting inlined.

I doubt that any of our Cython inline functions actually get inlined, but figured these are easiest to start with. The cython documentation itself states that class-level cdef functions have almost no chance of being inlined

https://cython.readthedocs.io/en/stable/src/userguide/pyrex_differences.html?highlight=inline#cdef-inline

@WillAyd
Copy link
Member Author

WillAyd commented Nov 23, 2022

FWIW gcc has an -finline_functions flag that would be enabled by default in our release builds. This would allow the compiler to decide for us what could be inline'd rather than having us try and say:

https://gcc.gnu.org/onlinedocs/gcc-12.1.0/gcc/Optimize-Options.html

But right now I think that gets clobbered by the amount of manually specified inlines we do. There's a mixed bag of errors but a lot are param inline-unit-growth limit reached which means the inlining would grow the compilation unit size too much

@WillAyd
Copy link
Member Author

WillAyd commented Nov 23, 2022

Similar function for CPython discourages its use:

https://docs.python.org/3/c-api/intro.html#c.Py_ALWAYS_INLINE

Marking blindly a static inline function with Py_ALWAYS_INLINE can result in worse performances 
(due to increased code size for example). The compiler is usually smarter than the developer 
for the cost/benefit analysis.

@mroeschke mroeschke added Refactor Internal refactoring of code Build Library building on various platforms labels Nov 23, 2022
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there no more warnings in these spots where you removed inline?

Also do % timits show no performance change in the relevant code paths?

@WillAyd
Copy link
Member Author

WillAyd commented Nov 29, 2022

Wasn't sure the best way to do a timeit so went a little more extreme, removed as many inline statements as possible (couldn't remove them from util.pxd) and ran the full test suite overnight. Results are below - I think any changes are spurious.

       before           after         ratio
     [82d271fe]       [9d9cca88]
     <main>           <remove-class-level-inline>
+     1.52±0.01ms         3.17±1ms     2.08  gil.ParallelRolling.time_rolling('mean')
+         130±3μs          194±1μs     1.49  index_cached_properties.IndexCache.time_is_monotonic_increasing('CategoricalIndex')
+         133±6μs        194±0.8μs     1.46  index_cached_properties.IndexCache.time_is_monotonic_decreasing('CategoricalIndex')
+       203±0.6μs          269±1μs     1.33  categoricals.Indexing.time_intersection
+     1.67±0.07ms       2.08±0.5ms     1.25  index_cached_properties.IndexCache.time_is_unique('MultiIndex')
+     12.5±0.04ms         15.6±4ms     1.24  rolling.Rank.time_rank('Series', 10, 'int', True, True, 'average')
+     1.41±0.01ms      1.73±0.01ms     1.23  timeseries.Lookup.time_lookup_and_cleanup
+      98.1±0.3μs        119±0.5μs     1.21  index_cached_properties.IndexCache.time_is_monotonic_decreasing('PeriodIndex')
+      98.2±0.4μs        119±0.5μs     1.21  index_cached_properties.IndexCache.time_is_monotonic_increasing('PeriodIndex')
+       100±0.1μs        121±0.1μs     1.21  index_cached_properties.IndexCache.time_is_monotonic_decreasing('DatetimeIndex')
+      99.3±0.7μs        120±0.4μs     1.21  index_cached_properties.IndexCache.time_is_monotonic_decreasing('TimedeltaIndex')
+        1.46±0ms       1.76±0.3ms     1.21  index_cached_properties.IndexCache.time_is_unique('Float64Index')
+     100.0±0.2μs        120±0.3μs     1.20  index_cached_properties.IndexCache.time_is_monotonic_increasing('DatetimeIndex')
+      99.5±0.5μs        120±0.4μs     1.20  index_cached_properties.IndexCache.time_is_monotonic_increasing('TimedeltaIndex')
+         949±5μs         1.11±0ms     1.17  rolling.ForwardWindowMethods.time_rolling('Series', 10, 'int', 'sum')
+        1.65±0ms      1.91±0.01ms     1.16  algorithms.Hashing.time_series_categorical
+        1.02±0ms      1.17±0.02ms     1.15  rolling.ForwardWindowMethods.time_rolling('Series', 10, 'float', 'mean')
+         983±4μs         1.13±0ms     1.15  rolling.ForwardWindowMethods.time_rolling('Series', 10, 'int', 'mean')
+         926±4μs         1.06±0ms     1.15  rolling.ForwardWindowMethods.time_rolling('Series', 1000, 'float', 'sum')
+         929±2μs         1.07±0ms     1.15  rolling.ForwardWindowMethods.time_rolling('Series', 10, 'float', 'sum')
+         965±2μs      1.11±0.01ms     1.15  rolling.ForwardWindowMethods.time_rolling('Series', 1000, 'int', 'sum')
+        1.01±0ms      1.15±0.01ms     1.14  rolling.ForwardWindowMethods.time_rolling('DataFrame', 10, 'float', 'sum')
+     1.04±0.01ms      1.19±0.01ms     1.14  rolling.ForwardWindowMethods.time_rolling('DataFrame', 10, 'int', 'sum')
+        1000±4μs         1.14±0ms     1.14  rolling.Methods.time_method('Series', ('rolling', {'window': 1000}), 'float', 'sum')
+         993±2μs      1.13±0.01ms     1.14  rolling.ForwardWindowMethods.time_rolling('Series', 1000, 'int', 'mean')
+        1.01±0ms         1.15±0ms     1.14  rolling.ForwardWindowMethods.time_rolling('DataFrame', 1000, 'float', 'sum')
+     1.05±0.01ms      1.20±0.01ms     1.14  rolling.ForwardWindowMethods.time_rolling('DataFrame', 1000, 'int', 'sum')
+        1.09±0ms      1.24±0.01ms     1.14  rolling.ForwardWindowMethods.time_rolling('DataFrame', 10, 'float', 'mean')
+     1.08±0.01ms      1.22±0.01ms     1.13  rolling.ForwardWindowMethods.time_rolling('DataFrame', 10, 'int', 'mean')
+        1.08±0ms      1.22±0.01ms     1.13  rolling.ForwardWindowMethods.time_rolling('DataFrame', 1000, 'int', 'mean')
+        1.04±0ms         1.17±0ms     1.13  rolling.Methods.time_method('Series', ('rolling', {'window': 10}), 'int', 'sum')
+         791±6μs          892±6μs     1.13  rolling.Methods.time_method('Series', ('expanding', {}), 'float', 'sum')
+        1.01±0ms         1.14±0ms     1.13  rolling.Methods.time_method('Series', ('rolling', {'window': 10}), 'float', 'sum')
+     1.03±0.01ms         1.16±0ms     1.13  rolling.ForwardWindowMethods.time_rolling('Series', 1000, 'float', 'mean')
+         823±5μs          926±5μs     1.13  rolling.Methods.time_method('Series', ('expanding', {}), 'int', 'sum')
+         837±3μs          940±3μs     1.12  rolling.Methods.time_method('Series', ('expanding', {}), 'int', 'mean')
+     1.04±0.01ms         1.17±0ms     1.12  rolling.Methods.time_method('Series', ('rolling', {'window': 1000}), 'int', 'sum')
+        1.11±0ms         1.24±0ms     1.12  rolling.ForwardWindowMethods.time_rolling('DataFrame', 1000, 'float', 'mean')
+     1.13±0.01ms      1.27±0.01ms     1.12  rolling.Methods.time_method('DataFrame', ('rolling', {'window': 10}), 'int', 'sum')
+        1.09±0ms         1.22±0ms     1.12  rolling.Methods.time_method('DataFrame', ('rolling', {'window': 10}), 'float', 'sum')
+         803±3μs          899±3μs     1.12  rolling.Methods.time_method('Series', ('expanding', {}), 'float', 'mean')
+        1.21±0ms      1.36±0.01ms     1.12  rolling.Methods.time_method('DataFrame', ('rolling', {'window': 1000}), 'int', 'mean')
+        1.09±0ms      1.22±0.01ms     1.12  rolling.Methods.time_method('DataFrame', ('rolling', {'window': 1000}), 'float', 'sum')
+        965±20μs      1.08±0.06ms     1.12  period.PeriodIndexConstructor.time_from_pydatetime('D', False)
+        1.32±0ms      1.48±0.01ms     1.12  rolling.VariableWindowMethods.time_method('Series', '50s', 'int', 'mean')
+        1.13±0ms         1.26±0ms     1.12  rolling.Methods.time_method('Series', ('rolling', {'window': 1000}), 'int', 'mean')
+     1.13±0.01ms      1.26±0.01ms     1.12  rolling.Methods.time_method('DataFrame', ('rolling', {'window': 1000}), 'int', 'sum')
+        1.10±0ms         1.23±0ms     1.12  rolling.Methods.time_method('Series', ('rolling', {'window': 1000}), 'float', 'mean')
+     1.11±0.01ms      1.24±0.01ms     1.12  rolling.Methods.time_method('Series', ('rolling', {'window': 10}), 'float', 'mean')
+         882±3μs          985±8μs     1.12  rolling.Methods.time_method('DataFrame', ('expanding', {}), 'float', 'sum')
+     1.18±0.01ms      1.32±0.01ms     1.12  rolling.Methods.time_method('DataFrame', ('rolling', {'window': 1000}), 'float', 'mean')
+     1.14±0.01ms         1.27±0ms     1.12  rolling.Methods.time_method('Series', ('rolling', {'window': 10}), 'int', 'mean')
+        1.19±0ms         1.33±0ms     1.11  rolling.VariableWindowMethods.time_method('Series', '1h', 'float', 'sum')
+        854±30ns        951±100ns     1.11  index_cached_properties.IndexCache.time_values('DatetimeIndex')
+        1.18±0ms         1.31±0ms     1.11  rolling.Methods.time_method('DataFrame', ('rolling', {'window': 10}), 'float', 'mean')
+        1.22±0ms      1.36±0.01ms     1.11  rolling.Methods.time_method('DataFrame', ('rolling', {'window': 10}), 'int', 'mean')
+     1.24±0.02ms      1.38±0.02ms     1.11  rolling.Methods.time_method('Series', ('rolling', {'window': 10}), 'float', 'count')
+        1.38±0ms         1.53±0ms     1.11  rolling.VariableWindowMethods.time_method('DataFrame', '1h', 'float', 'mean')
+        1.19±0ms         1.32±0ms     1.11  rolling.VariableWindowMethods.time_method('Series', '1d', 'float', 'sum')
+        1.67±0ms      1.85±0.01ms     1.11  rolling.ForwardWindowMethods.time_rolling('Series', 10, 'int', 'kurt')
+     1.34±0.02ms      1.48±0.01ms     1.11  rolling.Methods.time_method('DataFrame', ('rolling', {'window': 1000}), 'float', 'count')
+        1.24±0ms      1.37±0.01ms     1.11  rolling.VariableWindowMethods.time_method('Series', '50s', 'int', 'sum')
+        1.32±0ms      1.47±0.01ms     1.11  rolling.VariableWindowMethods.time_method('Series', '1h', 'int', 'mean')
+     1.23±0.01ms         1.36±0ms     1.11  rolling.VariableWindowMethods.time_method('Series', '1d', 'int', 'sum')
+        1.42±0ms         1.57±0ms     1.11  rolling.VariableWindowMethods.time_method('DataFrame', '1h', 'int', 'mean')
+         558±1μs          618±1μs     1.11  reindex.ReindexMethod.time_reindex_method('pad', <function period_range>)
+        1.30±0ms         1.43±0ms     1.11  rolling.VariableWindowMethods.time_method('DataFrame', '50s', 'float', 'sum')
+         961±4μs         1.06±0ms     1.11  rolling.Methods.time_method('Series', ('expanding', {}), 'float', 'std')
+         553±2μs          612±1μs     1.11  reindex.ReindexMethod.time_reindex_method('pad', <function date_range>)
+     1.24±0.01ms      1.37±0.02ms     1.11  rolling.Methods.time_method('Series', ('rolling', {'window': 1000}), 'float', 'count')
+        1.29±0ms      1.43±0.01ms     1.10  rolling.VariableWindowMethods.time_method('Series', '50s', 'float', 'mean')
+         926±8μs         1.02±0ms     1.10  rolling.Methods.time_method('DataFrame', ('expanding', {}), 'int', 'mean')
+        1.28±0ms         1.41±0ms     1.10  rolling.VariableWindowMethods.time_method('Series', '1d', 'float', 'mean')
+     1.33±0.01ms         1.47±0ms     1.10  rolling.VariableWindowMethods.time_method('DataFrame', '50s', 'int', 'sum')
+         894±1μs          984±2μs     1.10  rolling.Methods.time_method('DataFrame', ('expanding', {}), 'float', 'mean')
+         920±6μs      1.01±0.01ms     1.10  rolling.Methods.time_method('DataFrame', ('expanding', {}), 'int', 'sum')
+        1.31±0ms         1.44±0ms     1.10  rolling.VariableWindowMethods.time_method('Series', '1d', 'int', 'mean')
+     1.25±0.01ms         1.37±0ms     1.10  rolling.VariableWindowMethods.time_method('Series', '1h', 'int', 'sum')
+        1.34±0ms         1.47±0ms     1.10  rolling.VariableWindowMethods.time_method('DataFrame', '1h', 'int', 'sum')
-       181±0.3μs          165±1μs     0.91  index_cached_properties.IndexCache.time_is_monotonic_increasing('UInt64Index')
-       82.4±10ms       74.6±0.5ms     0.91  algos.isin.IsinWithArangeSorted.time_isin(<class 'numpy.float64'>, 1000000)
-     2.24±0.08ms         2.02±0ms     0.90  frame_methods.Fillna.time_frame_fillna(False, 'pad', 'float32')
-      1.45±0.5μs         1.31±0μs     0.90  tslibs.fields.TimeGetStartEndField.time_get_start_end_field(0, 'start', 'month', None, 12)
-        749±30μs         672±30μs     0.90  dtypes.SelectDtypes.time_select_dtype_string_exclude('Int8')
-        653±30μs         585±20μs     0.90  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function ge>)
-         473±1μs         421±10μs     0.89  dtypes.SelectDtypes.time_select_dtype_int_exclude('uint32')
-        454±20μs          401±5μs     0.88  dtypes.SelectDtypes.time_select_dtype_int_exclude('complex64')
-         454±1μs         400±10μs     0.88  dtypes.SelectDtypes.time_select_dtype_int_include(<class 'int'>)
-        11.1±1ms      9.77±0.03ms     0.88  arithmetic.FrameWithFrameWide.time_op_different_blocks(<built-in function gt>, (10000, 1000))
-     3.16±0.02ms       2.78±0.2ms     0.88  frame_methods.Fillna.time_frame_fillna(True, 'pad', 'float64')
-      3.00±0.1ms      2.58±0.01ms     0.86  frame_methods.Fillna.time_frame_fillna(True, 'pad', 'float32')
-      3.03±0.1ms      2.59±0.01ms     0.85  frame_methods.Fillna.time_frame_fillna(True, 'bfill', 'float32')
-      14.7±0.4ms       12.4±0.2ms     0.84  multiindex_object.SetOperations.time_operation('monotonic', 'int', 'intersection', None)
-        768±30μs         640±20μs     0.83  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function ne>)
-       703±100μs         577±20μs     0.82  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function eq>)
-        762±60μs         615±30μs     0.81  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function ge>)
-         176±3μs        137±0.2μs     0.78  index_cached_properties.IndexCache.time_is_monotonic_increasing('Float64Index')
-      2.22±0.3μs       1.73±0.1μs     0.78  index_cached_properties.IndexCache.time_shape('CategoricalIndex')
-      1.77±0.3μs       1.37±0.1μs     0.77  index_cached_properties.IndexCache.time_values('CategoricalIndex')
-         178±3μs        137±0.3μs     0.77  index_cached_properties.IndexCache.time_is_monotonic_decreasing('Float64Index')
-       972±200μs         562±20μs     0.58  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function lt>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

As far as warnings go, this gets us down a few thousand, though there's still 150k+ lines of warnings. I don't know we will ever be able to use Winline since I think Cython uses it a lot.

@jbrockmendel
Copy link
Member

mind double-checking the worst 2ish with timeit?

@jreback
Copy link
Contributor

jreback commented Nov 30, 2022

does this change the build sizes (eg wheels)?

@WillAyd
Copy link
Member Author

WillAyd commented Nov 30, 2022

Wasn't sure how to use timeit for the first benchmark since it measures parallel execution, but a subsequent run of just that yields no changes:

asv continuous -f 1.1 upstream/main HEAD -b gil.ParallelRolling.time_rolling

[100.00%] ··· gil.ParallelRolling.time_rolling                                                                                   ok
[100.00%] ··· ======== =============
               method               
              -------- -------------
               median     43.2±1ms  
                mean    1.53±0.03ms 
                min     2.24±0.01ms 
                max     2.24±0.02ms 
                var     2.00±0.03ms 
                skew    2.75±0.05ms 
                kurt    3.04±0.02ms 
                std     2.16±0.01ms 
              ======== =============


BENCHMARKS NOT SIGNIFICANTLY CHANGED.

Here's the next few entries on this branch:

In [3]: import pandas as pd
   ...: N = 10**5
   ...: idx = pd.CategoricalIndex(range(N), range(N))
   ...: idx._cache = {}
   ...: %timeit idx.is_monotonic_increasing
   ...: %timeit idx.is_monotonic_decreasing
   ...: 
92.5 ns ± 0.651 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
93.7 ns ± 0.81 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

versus main

95.5 ns ± 0.888 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
99.2 ns ± 0.836 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

@WillAyd
Copy link
Member Author

WillAyd commented Nov 30, 2022

does this change the build sizes (eg wheels)?

Looks to be 24kb larger on this branch than main

@@ -1082,13 +1067,13 @@ def roll_min(ndarray[float64_t] values, ndarray[int64_t] start,
return _roll_min_max(values, start, end, minp, is_max=0)


cdef _roll_min_max(ndarray[numeric_t] values,
cdef _roll_min_max(ndarray[float64_t] values,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly we have type specialization built into this module but only ever call these functions with float64_t arrays. -Wunused-function does not warn about unused inline static functions so this went undetected

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, yeah there's an issue to have rolling ops not always cast & return float64 but better to remove this for now

@jbrockmendel
Copy link
Member

thanks for double-checking. this seems benign to me.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a merge conflict

@mroeschke mroeschke added this to the 2.0 milestone Dec 6, 2022
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. cc @jbrockmendel merge when ready

@jbrockmendel jbrockmendel merged commit d660831 into pandas-dev:main Dec 6, 2022
@jbrockmendel
Copy link
Member

thanks @WillAyd

@WillAyd WillAyd deleted the remove-class-level-inline branch April 12, 2023 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Library building on various platforms Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants