PERF: speed up CategoricalIndex.get_loc #23235

topper-123 · 2018-10-19T02:21:26Z

closes PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This is the the final puzzle for #20395 and speeds up CategoricalIndex.get_loc for monotonic increasing indexes.

This PR supersedes #21699, so #21699 should be closed.

The problem with the current get_loc implementation is that CategoricalIndex._engine constantly recodes int8 arrays to int64 arrays:

pandas/pandas/core/indexes/category.py

Lines 365 to 369 in dc45fba

    
           @cache_readonly 
        
           def _engine(self): 
        
               # we are going to look things up with the codes themselves 
        
               return self._engine_type(lambda: self.codes.astype('i8'), len(self))

Notice the lambda: self.codes.astype('i8') part, which means that every time _engine.vgetter is called, an int64-array is created. This is expensive and we would ideally want to just use the original array dtype (int8, int16 or whatever) always and avoid this conversion.

A complicating issue is that it is not enough to just avoid the int64 transformation, as array.searchsorted apparantly needs a dtype-compatible input or else it is also very slow:

>>> n = 1_000_000
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))
>>> %timeit ci.codes.searchsorted(1)  # search for left location of 'b'
7.38 ms   # slow
>>> code = np.int8(1)
>>> %timeit ci.codes.searchsorted(code)
2.57 µs  # fast

Solution

As CategoricalIndex.codes may be int8, int16, etc, the solution must
(1) have an indexing engine for each integer dtype and
(2) have the code for the key be translated into the same dtype as the codes array before calling searchsorted.

This PR does that, essentially.

Performance improvement examples

>>> n = 100_000
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))
>>> %timeit ci.get_loc('b')
2.05 ms  # master
8.96 µs  # this PR
>>> n = 1_000_000
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))
>>> %timeit ci.get_loc('b')
18.7 ms  # master
9.09 µs  # this PR

So we go from O(n) performance to O(1) performance.

The indexing_engines.py ASV results:

[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running indexing_engines.NumericEngineIndexing.time_get_loc                                                        7.18±1μs;...
[100.00%] ··· Running indexing_engines.ObjectEngineIndexing.time_get_loc                                                       7.04±0.4μs;...

pep8speaks · 2018-10-19T02:21:29Z

Hello @topper-123! Thanks for updating the PR.

There are no PEP8 issues in the file asv_bench/benchmarks/indexing_engines.py !
There are no PEP8 issues in the file pandas/core/indexes/category.py !
There are no PEP8 issues in the file pandas/tests/indexing/conftest.py !
There are no PEP8 issues in the file pandas/tests/indexing/test_indexing_engines.py !

Comment last updated on October 20, 2018 at 13:59 Hours UTC

shenker · 2018-10-19T05:16:51Z

Out of curiosity, does this fix improve .get_loc performance for MultiIndexes?

topper-123 · 2018-10-19T06:17:30Z

No, MultiIndex has a different implementation, and if I remember correctly, it always uses either int64 dtype or a pure python implementation. So the issue with recoding from int8 to int64 etc. should not be an issue there. (I have not studied the MultiEngine implementation in any great detail, so closer inspection might prove me wrong.)

codecov · 2018-10-19T18:47:54Z

Codecov Report

Merging #23235 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #23235      +/-   ##
==========================================
+ Coverage   92.22%   92.22%   +<.01%     
==========================================
  Files         169      169              
  Lines       50900    50902       +2     
==========================================
+ Hits        46943    46945       +2     
  Misses       3957     3957

Flag	Coverage Δ
#multiple	`90.65% <100%> (ø)`	⬆️
#single	`42.29% <75%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/indexes/category.py	`97.84% <100%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 14f25bb...a519cc4. Read the comment docs.

topper-123 · 2018-10-19T19:13:41Z

All tests pass.

@jreback, I've simplified this new version a lot now by adding 'hashtable_name' and 'hashtable_dtype' to the dtype list in index_class_helper.pxi.in. This means that we now avoid adding so many if-statements to the code.

jreback · 2018-10-19T19:17:26Z

lgtm. over to @jbrockmendel

jbrockmendel · 2018-10-19T21:32:16Z

pandas/_libs/algos.pyx

@@ -10,7 +10,8 @@ from libc.math cimport fabs, sqrt
 import numpy as np
 cimport numpy as cnp
 from numpy cimport (ndarray,
-                    NPY_INT64, NPY_UINT64, NPY_INT32, NPY_INT16, NPY_INT8,
+                    NPY_INT64, NPY_INT32, NPY_INT16, NPY_INT8,
+                    NPY_UINT64, NPY_UINT32, NPY_UINT16, NPY_UINT8,


Are all of these used?

NPY_UINT32, NPY_UINT16 and uint23_t and uint16_t are not used. This needs a discussion, see below.

pandas/_libs/index.pyx

jbrockmendel · 2018-10-19T22:44:47Z

asv_bench/benchmarks/indexing_engines.py

@@ -1,19 +1,31 @@
 import numpy as np

-from pandas._libs.index import (Int64Engine, UInt64Engine, Float64Engine,
-                                ObjectEngine)
+from pandas._libs import index as li


not a big deal, but I think elsewhere we import this as libindex

jbrockmendel · 2018-10-19T22:46:18Z

The explanation in the OP is exceptionally clear, thanks @topper-123.

If the possibly-unused imports are resolved, LGTM.

jbrockmendel · 2018-10-19T22:47:09Z

Actually, do we have tests that cover each of the dtypes that codes might take?

topper-123 · 2018-10-20T08:05:52Z

Currently, the algos in algos.pyx do not support uint32 and uint16. To get UInt32Engine and UInt16Engine to work we need to have at least ensure_uint32/_uint16 and is_monotonic_uint32/uint_16, but in that case we should probably just make all the algos have uint32/uint16 variants for consistency.

For exampel currently this fails:

>>> arr = np.array([1,2,3,4], dtype=np.uint32)
>>> engine = pd._libs.index.UInt32Engine(lambda: arr, 4)
>>> engine.is_monotonic_increasing
AttributeError: module 'pandas._libs.algos' has no attribute 'ensure_uint32'

As I understood @jreback, UInt(32|16)Engine is desired, so I'll add the various uint32/uint16 variants to algos.pyx to make the uint indexing engines work, is that correct?

I'll also add a test_indexing_engines.py to capture such errors at the one above.

topper-123 · 2018-10-20T12:06:37Z

Added various stuff to get UInt(32|16)Engine to work, added tests for same and other minor stuff.

topper-123 · 2018-10-20T18:03:47Z

@jbrockmendel, I've made tests for CategoricalIndex._engine being of the correct type.

Also, I've made some tests for engine themselves in test_indexing_engine.py. This last bit is needed because some engine types are not used yet in pandas. (These are UInt(8|16|32)Engine), so we need to check directly that the cython methods work for those engine types.

jreback · 2018-10-23T03:23:02Z

can you merge master and push

jreback · 2018-10-23T12:35:36Z

pandas/tests/indexing/test_indexing_engines.py

+from pandas import compat
+from pandas._libs import algos as libalgos, index as libindex
+
+


is there a way to just combine these into a single set of tests w/o regards to whether this is object / numeric. this is pretty duplicative set of tests

Following the discussion below, I could just remove the tests for ObjectEngine. ObjectEngine is used in lots of other places.

jreback · 2018-10-23T12:36:08Z

pandas/tests/indexing/test_indexing_engines.py

+        # monotonic increasing
+        engine = engine_type(lambda: arr, len(arr))
+        assert engine.is_monotonic_increasing is True
+        assert engine.is_monotonic_decreasing is False


you are adding a ton of tests here which is good. but i suspect a lot of this indexing on the engines is already tested elsewhere, can you remove it where appropriate.

Int(32|16|8)Engine, UInt(32|16|8)Engine and Float(32|16|8)Engine are all new, and of these new ones, only Int(32|16|8)Engine are used on code elsewhere (In CategoricalIndex).

This means that unexpected failures could happen in those unused engines...

I could remove tests for Int(8|16|32|64)Engine and (UInt|Float)64Engine.

not what I mean. I think we have explict test for these engines elsewhere, see if you can find them an dconsolidate them here. You have a nice set of unit tests, but we want to avoid some duplicaton elsehwere

I can't find any tests explicitly for these engines. Lots of tests for e.g. Index.is_monotonic etc. but that's a differerent set of test IMO and we should always have tests for the public APIs.

I've searched for "_engine" and "engine" in the pandas/tests directory, but I'm coming up short. I've found tests for various parser engines, but I can't find any for indexing engines (except one insignificant one in multi/test_contains.py).

If you really think there are tests for indexing engines, could you point me to an example?

jreback · 2018-10-26T01:11:59Z

thanks @topper-123

…_pr2 * repo_org/master: DOC: Add docstring validations for "See Also" section (pandas-dev#23143) TST: Fix test assertion (pandas-dev#23357) BUG: Handle Period in combine (pandas-dev#23350) REF: SparseArray imports (pandas-dev#23329) CI: Migrate some CircleCI jobs to Azure (pandas-dev#22992) DOC: update the is_month_start/is_month_end docstring (pandas-dev#23051) Partialy fix issue pandas-dev#23334 - isort pandas/core/groupby directory (pandas-dev#23341) TST: Add base test for extensionarray setitem pandas-dev#23300 (pandas-dev#23304) API: Add sparse Acessor (pandas-dev#23183) PERF: speed up CategoricalIndex.get_loc (pandas-dev#23235)

…ndas * repo_org/master: (23 commits) DOC: Add docstring validations for "See Also" section (pandas-dev#23143) TST: Fix test assertion (pandas-dev#23357) BUG: Handle Period in combine (pandas-dev#23350) REF: SparseArray imports (pandas-dev#23329) CI: Migrate some CircleCI jobs to Azure (pandas-dev#22992) DOC: update the is_month_start/is_month_end docstring (pandas-dev#23051) Partialy fix issue pandas-dev#23334 - isort pandas/core/groupby directory (pandas-dev#23341) TST: Add base test for extensionarray setitem pandas-dev#23300 (pandas-dev#23304) API: Add sparse Acessor (pandas-dev#23183) PERF: speed up CategoricalIndex.get_loc (pandas-dev#23235) fix and test incorrect case in delta_to_nanoseconds (pandas-dev#23302) BUG: Handle Datetimelike data in DataFrame.combine (pandas-dev#23317) TST: re-enable gbq tests (pandas-dev#23303) Switched references of App veyor to azure pipelines in the contributing CI section (pandas-dev#23311) isort imports-io (pandas-dev#23332) DOC: Added a Multi Index example for the Series.sum method (pandas-dev#23279) REF: Make PeriodArray an ExtensionArray (pandas-dev#22862) DOC: Added Examples for Series max (pandas-dev#23298) API/ENH: tz_localize handling of nonexistent times: rename keyword + add shift option (pandas-dev#22644) BUG: Let MultiIndex.set_levels accept any iterable (pandas-dev#23273) (pandas-dev#23291) ...

…xamples * repo_org/master: (83 commits) DOC: Add docstring validations for "See Also" section (pandas-dev#23143) TST: Fix test assertion (pandas-dev#23357) BUG: Handle Period in combine (pandas-dev#23350) REF: SparseArray imports (pandas-dev#23329) CI: Migrate some CircleCI jobs to Azure (pandas-dev#22992) DOC: update the is_month_start/is_month_end docstring (pandas-dev#23051) Partialy fix issue pandas-dev#23334 - isort pandas/core/groupby directory (pandas-dev#23341) TST: Add base test for extensionarray setitem pandas-dev#23300 (pandas-dev#23304) API: Add sparse Acessor (pandas-dev#23183) PERF: speed up CategoricalIndex.get_loc (pandas-dev#23235) fix and test incorrect case in delta_to_nanoseconds (pandas-dev#23302) BUG: Handle Datetimelike data in DataFrame.combine (pandas-dev#23317) TST: re-enable gbq tests (pandas-dev#23303) Switched references of App veyor to azure pipelines in the contributing CI section (pandas-dev#23311) isort imports-io (pandas-dev#23332) DOC: Added a Multi Index example for the Series.sum method (pandas-dev#23279) REF: Make PeriodArray an ExtensionArray (pandas-dev#22862) DOC: Added Examples for Series max (pandas-dev#23298) API/ENH: tz_localize handling of nonexistent times: rename keyword + add shift option (pandas-dev#22644) BUG: Let MultiIndex.set_levels accept any iterable (pandas-dev#23273) (pandas-dev#23291) ...

topper-123 mentioned this pull request Oct 19, 2018

PERF: CategoricalIndex.get_loc should avoid expensive cast of .codes to int64 #21699

Closed

4 tasks

topper-123 force-pushed the CategoricalIndex.get_loc_II branch from 77f35df to adfcb0c Compare October 19, 2018 02:27

topper-123 force-pushed the CategoricalIndex.get_loc_II branch 2 times, most recently from 505fc8c to f7b3487 Compare October 19, 2018 16:47

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Oct 19, 2018

jreback requested a review from jbrockmendel October 19, 2018 19:16

jreback added this to the 0.24.0 milestone Oct 19, 2018

jbrockmendel reviewed Oct 19, 2018

View reviewed changes

pandas/_libs/index.pyx Show resolved Hide resolved

jbrockmendel reviewed Oct 19, 2018

View reviewed changes

topper-123 force-pushed the CategoricalIndex.get_loc_II branch from f3ad606 to 1b55d52 Compare October 20, 2018 10:11

topper-123 force-pushed the CategoricalIndex.get_loc_II branch 5 times, most recently from dea7538 to f91f3c7 Compare October 20, 2018 17:59

topper-123 force-pushed the CategoricalIndex.get_loc_II branch from f91f3c7 to 08cd0b9 Compare October 20, 2018 18:05

topper-123 added 2 commits October 23, 2018 10:32

PERF: speed up CategoricalIndex.get_loc

5e2ccc8

add tests for engines and add is_monotonic_uint(32|16)

a519cc4

topper-123 force-pushed the CategoricalIndex.get_loc_II branch from 08cd0b9 to a519cc4 Compare October 23, 2018 09:57

jreback requested changes Oct 23, 2018

View reviewed changes

jreback approved these changes Oct 26, 2018

View reviewed changes

jreback merged commit f79e174 into pandas-dev:master Oct 26, 2018

topper-123 mentioned this pull request Oct 26, 2018

PERF: df.loc is 100x slower for CategoricalIndex than for normal Index #20395

Closed

topper-123 deleted the CategoricalIndex.get_loc_II branch October 26, 2018 01:50

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018

PERF: speed up CategoricalIndex.get_loc (pandas-dev#23235)

eb02b72

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

PERF: speed up CategoricalIndex.get_loc (pandas-dev#23235)

aa1fbb5

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

PERF: speed up CategoricalIndex.get_loc (pandas-dev#23235)

69f57d1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: speed up CategoricalIndex.get_loc #23235

PERF: speed up CategoricalIndex.get_loc #23235

topper-123 commented Oct 19, 2018 •

edited

Loading

pep8speaks commented Oct 19, 2018 •

edited

Loading

shenker commented Oct 19, 2018

topper-123 commented Oct 19, 2018

codecov bot commented Oct 19, 2018 •

edited

Loading

topper-123 commented Oct 19, 2018 •

edited

Loading

jreback commented Oct 19, 2018

jbrockmendel Oct 19, 2018

topper-123 Oct 20, 2018

jbrockmendel Oct 19, 2018

jbrockmendel commented Oct 19, 2018

jbrockmendel commented Oct 19, 2018

topper-123 commented Oct 20, 2018 •

edited

Loading

topper-123 commented Oct 20, 2018 •

edited

Loading

topper-123 commented Oct 20, 2018 •

edited

Loading

jreback commented Oct 23, 2018

jreback Oct 23, 2018

topper-123 Oct 23, 2018 •

edited

Loading

jreback Oct 23, 2018

topper-123 Oct 23, 2018

jreback Oct 23, 2018

topper-123 Oct 23, 2018 •

edited

Loading

jreback commented Oct 26, 2018

	@cache_readonly
	def _engine(self):

	# we are going to look things up with the codes themselves
	return self._engine_type(lambda: self.codes.astype('i8'), len(self))

		from pandas import compat
		from pandas._libs import algos as libalgos, index as libindex

PERF: speed up CategoricalIndex.get_loc #23235

PERF: speed up CategoricalIndex.get_loc #23235

Conversation

topper-123 commented Oct 19, 2018 • edited Loading

Solution

Performance improvement examples

pep8speaks commented Oct 19, 2018 • edited Loading

Comment last updated on October 20, 2018 at 13:59 Hours UTC

shenker commented Oct 19, 2018

topper-123 commented Oct 19, 2018

codecov bot commented Oct 19, 2018 • edited Loading

Codecov Report

topper-123 commented Oct 19, 2018 • edited Loading

jreback commented Oct 19, 2018

jbrockmendel Oct 19, 2018

Choose a reason for hiding this comment

topper-123 Oct 20, 2018

Choose a reason for hiding this comment

jbrockmendel Oct 19, 2018

Choose a reason for hiding this comment

jbrockmendel commented Oct 19, 2018

jbrockmendel commented Oct 19, 2018

topper-123 commented Oct 20, 2018 • edited Loading

topper-123 commented Oct 20, 2018 • edited Loading

topper-123 commented Oct 20, 2018 • edited Loading

jreback commented Oct 23, 2018

jreback Oct 23, 2018

Choose a reason for hiding this comment

topper-123 Oct 23, 2018 • edited Loading

Choose a reason for hiding this comment

jreback Oct 23, 2018

Choose a reason for hiding this comment

topper-123 Oct 23, 2018

Choose a reason for hiding this comment

jreback Oct 23, 2018

Choose a reason for hiding this comment

topper-123 Oct 23, 2018 • edited Loading

Choose a reason for hiding this comment

jreback commented Oct 26, 2018

topper-123 commented Oct 19, 2018 •

edited

Loading

pep8speaks commented Oct 19, 2018 •

edited

Loading

codecov bot commented Oct 19, 2018 •

edited

Loading

topper-123 commented Oct 19, 2018 •

edited

Loading

topper-123 commented Oct 20, 2018 •

edited

Loading

topper-123 commented Oct 20, 2018 •

edited

Loading

topper-123 commented Oct 20, 2018 •

edited

Loading

topper-123 Oct 23, 2018 •

edited

Loading

topper-123 Oct 23, 2018 •

edited

Loading