Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: speed up CategoricalIndex.get_loc #23235

Merged
merged 2 commits into from
Oct 26, 2018

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Oct 19, 2018

This is the the final puzzle for #20395 and speeds up CategoricalIndex.get_loc for monotonic increasing indexes.

This PR supersedes #21699, so #21699 should be closed.

The problem with the current get_loc implementation is that CategoricalIndex._engine constantly recodes int8 arrays to int64 arrays:

@cache_readonly
def _engine(self):
# we are going to look things up with the codes themselves
return self._engine_type(lambda: self.codes.astype('i8'), len(self))

Notice the lambda: self.codes.astype('i8') part, which means that every time _engine.vgetter is called, an int64-array is created. This is expensive and we would ideally want to just use the original array dtype (int8, int16 or whatever) always and avoid this conversion.

A complicating issue is that it is not enough to just avoid the int64 transformation, as array.searchsorted apparantly needs a dtype-compatible input or else it is also very slow:

>>> n = 1_000_000
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))
>>> %timeit ci.codes.searchsorted(1)  # search for left location of 'b'
7.38 ms   # slow
>>> code = np.int8(1)
>>> %timeit ci.codes.searchsorted(code)
2.57 µs  # fast

Solution

As CategoricalIndex.codes may be int8, int16, etc, the solution must
(1) have an indexing engine for each integer dtype and
(2) have the code for the key be translated into the same dtype as the codes array before calling searchsorted.

This PR does that, essentially.

Performance improvement examples

>>> n = 100_000
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))
>>> %timeit ci.get_loc('b')
2.05 ms  # master
8.96 µs  # this PR
>>> n = 1_000_000
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))
>>> %timeit ci.get_loc('b')
18.7 ms  # master
9.09 µs  # this PR

So we go from O(n) performance to O(1) performance.

The indexing_engines.py ASV results:

[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running indexing_engines.NumericEngineIndexing.time_get_loc                                                        7.18±1μs;...
[100.00%] ··· Running indexing_engines.ObjectEngineIndexing.time_get_loc                                                       7.04±0.4μs;...

@pep8speaks
Copy link

pep8speaks commented Oct 19, 2018

Hello @topper-123! Thanks for updating the PR.

Comment last updated on October 20, 2018 at 13:59 Hours UTC

@shenker
Copy link

shenker commented Oct 19, 2018

Out of curiosity, does this fix improve .get_loc performance for MultiIndexes?

@topper-123
Copy link
Contributor Author

No, MultiIndex has a different implementation, and if I remember correctly, it always uses either int64 dtype or a pure python implementation. So the issue with recoding from int8 to int64 etc. should not be an issue there. (I have not studied the MultiEngine implementation in any great detail, so closer inspection might prove me wrong.)

@topper-123 topper-123 force-pushed the CategoricalIndex.get_loc_II branch 2 times, most recently from 505fc8c to f7b3487 Compare October 19, 2018 16:47
@codecov
Copy link

codecov bot commented Oct 19, 2018

Codecov Report

Merging #23235 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23235      +/-   ##
==========================================
+ Coverage   92.22%   92.22%   +<.01%     
==========================================
  Files         169      169              
  Lines       50900    50902       +2     
==========================================
+ Hits        46943    46945       +2     
  Misses       3957     3957
Flag Coverage Δ
#multiple 90.65% <100%> (ø) ⬆️
#single 42.29% <75%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/category.py 97.84% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 14f25bb...a519cc4. Read the comment docs.

@topper-123
Copy link
Contributor Author

topper-123 commented Oct 19, 2018

All tests pass.

@jreback, I've simplified this new version a lot now by adding 'hashtable_name' and 'hashtable_dtype' to the dtype list in index_class_helper.pxi.in. This means that we now avoid adding so many if-statements to the code.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Oct 19, 2018
@jreback jreback added this to the 0.24.0 milestone Oct 19, 2018
@jreback
Copy link
Contributor

jreback commented Oct 19, 2018

lgtm. over to @jbrockmendel

@@ -10,7 +10,8 @@ from libc.math cimport fabs, sqrt
import numpy as np
cimport numpy as cnp
from numpy cimport (ndarray,
NPY_INT64, NPY_UINT64, NPY_INT32, NPY_INT16, NPY_INT8,
NPY_INT64, NPY_INT32, NPY_INT16, NPY_INT8,
NPY_UINT64, NPY_UINT32, NPY_UINT16, NPY_UINT8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of these used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NPY_UINT32, NPY_UINT16 and uint23_t and uint16_t are not used. This needs a discussion, see below.

@@ -1,19 +1,31 @@
import numpy as np

from pandas._libs.index import (Int64Engine, UInt64Engine, Float64Engine,
ObjectEngine)
from pandas._libs import index as li
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal, but I think elsewhere we import this as libindex

@jbrockmendel
Copy link
Member

The explanation in the OP is exceptionally clear, thanks @topper-123.

If the possibly-unused imports are resolved, LGTM.

@jbrockmendel
Copy link
Member

Actually, do we have tests that cover each of the dtypes that codes might take?

@topper-123
Copy link
Contributor Author

topper-123 commented Oct 20, 2018

Currently, the algos in algos.pyx do not support uint32 and uint16. To get UInt32Engine and UInt16Engine to work we need to have at least ensure_uint32/_uint16 and is_monotonic_uint32/uint_16, but in that case we should probably just make all the algos have uint32/uint16 variants for consistency.

For exampel currently this fails:

>>> arr = np.array([1,2,3,4], dtype=np.uint32)
>>> engine = pd._libs.index.UInt32Engine(lambda: arr, 4)
>>> engine.is_monotonic_increasing
AttributeError: module 'pandas._libs.algos' has no attribute 'ensure_uint32'

As I understood @jreback, UInt(32|16)Engine is desired, so I'll add the various uint32/uint16 variants to algos.pyx to make the uint indexing engines work, is that correct?

I'll also add a test_indexing_engines.py to capture such errors at the one above.

@topper-123
Copy link
Contributor Author

topper-123 commented Oct 20, 2018

Added various stuff to get UInt(32|16)Engine to work, added tests for same and other minor stuff.

@topper-123 topper-123 force-pushed the CategoricalIndex.get_loc_II branch 5 times, most recently from dea7538 to f91f3c7 Compare October 20, 2018 17:59
@topper-123
Copy link
Contributor Author

topper-123 commented Oct 20, 2018

@jbrockmendel, I've made tests for CategoricalIndex._engine being of the correct type.

Also, I've made some tests for engine themselves in test_indexing_engine.py. This last bit is needed because some engine types are not used yet in pandas. (These are UInt(8|16|32)Engine), so we need to check directly that the cython methods work for those engine types.

@jreback
Copy link
Contributor

jreback commented Oct 23, 2018

can you merge master and push

from pandas import compat
from pandas._libs import algos as libalgos, index as libindex


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to just combine these into a single set of tests w/o regards to whether this is object / numeric. this is pretty duplicative set of tests

Copy link
Contributor Author

@topper-123 topper-123 Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the discussion below, I could just remove the tests for ObjectEngine. ObjectEngine is used in lots of other places.

# monotonic increasing
engine = engine_type(lambda: arr, len(arr))
assert engine.is_monotonic_increasing is True
assert engine.is_monotonic_decreasing is False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are adding a ton of tests here which is good. but i suspect a lot of this indexing on the engines is already tested elsewhere, can you remove it where appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int(32|16|8)Engine, UInt(32|16|8)Engine and Float(32|16|8)Engine are all new, and of these new ones, only Int(32|16|8)Engine are used on code elsewhere (In CategoricalIndex).

This means that unexpected failures could happen in those unused engines...

I could remove tests for Int(8|16|32|64)Engine and (UInt|Float)64Engine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not what I mean. I think we have explict test for these engines elsewhere, see if you can find them an dconsolidate them here. You have a nice set of unit tests, but we want to avoid some duplicaton elsehwere

Copy link
Contributor Author

@topper-123 topper-123 Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any tests explicitly for these engines. Lots of tests for e.g. Index.is_monotonic etc. but that's a differerent set of test IMO and we should always have tests for the public APIs.

I've searched for "_engine" and "engine" in the pandas/tests directory, but I'm coming up short. I've found tests for various parser engines, but I can't find any for indexing engines (except one insignificant one in multi/test_contains.py).

If you really think there are tests for indexing engines, could you point me to an example?

@jreback jreback merged commit f79e174 into pandas-dev:master Oct 26, 2018
@jreback
Copy link
Contributor

jreback commented Oct 26, 2018

thanks @topper-123

@topper-123 topper-123 deleted the CategoricalIndex.get_loc_II branch October 26, 2018 01:50
thoo added a commit to thoo/pandas that referenced this pull request Oct 26, 2018
…_pr2

* repo_org/master:
  DOC: Add docstring validations for "See Also" section (pandas-dev#23143)
  TST: Fix test assertion (pandas-dev#23357)
  BUG: Handle Period in combine (pandas-dev#23350)
  REF: SparseArray imports (pandas-dev#23329)
  CI: Migrate some CircleCI jobs to Azure (pandas-dev#22992)
  DOC: update the is_month_start/is_month_end docstring (pandas-dev#23051)
  Partialy fix issue pandas-dev#23334 - isort pandas/core/groupby directory (pandas-dev#23341)
  TST: Add base test for extensionarray setitem pandas-dev#23300 (pandas-dev#23304)
  API: Add sparse Acessor (pandas-dev#23183)
  PERF: speed up CategoricalIndex.get_loc (pandas-dev#23235)
thoo added a commit to thoo/pandas that referenced this pull request Oct 27, 2018
…ndas

* repo_org/master: (23 commits)
  DOC: Add docstring validations for "See Also" section (pandas-dev#23143)
  TST: Fix test assertion (pandas-dev#23357)
  BUG: Handle Period in combine (pandas-dev#23350)
  REF: SparseArray imports (pandas-dev#23329)
  CI: Migrate some CircleCI jobs to Azure (pandas-dev#22992)
  DOC: update the is_month_start/is_month_end docstring (pandas-dev#23051)
  Partialy fix issue pandas-dev#23334 - isort pandas/core/groupby directory (pandas-dev#23341)
  TST: Add base test for extensionarray setitem pandas-dev#23300 (pandas-dev#23304)
  API: Add sparse Acessor (pandas-dev#23183)
  PERF: speed up CategoricalIndex.get_loc (pandas-dev#23235)
  fix and test incorrect case in delta_to_nanoseconds (pandas-dev#23302)
  BUG: Handle Datetimelike data in DataFrame.combine (pandas-dev#23317)
  TST: re-enable gbq tests (pandas-dev#23303)
  Switched references of App veyor to azure pipelines in the contributing CI section (pandas-dev#23311)
  isort imports-io (pandas-dev#23332)
  DOC: Added a Multi Index example for the Series.sum method (pandas-dev#23279)
  REF: Make PeriodArray an ExtensionArray (pandas-dev#22862)
  DOC: Added Examples for Series max (pandas-dev#23298)
  API/ENH: tz_localize handling of nonexistent times: rename keyword + add shift option (pandas-dev#22644)
  BUG: Let MultiIndex.set_levels accept any iterable (pandas-dev#23273) (pandas-dev#23291)
  ...
thoo added a commit to thoo/pandas that referenced this pull request Oct 27, 2018
…xamples

* repo_org/master: (83 commits)
  DOC: Add docstring validations for "See Also" section (pandas-dev#23143)
  TST: Fix test assertion (pandas-dev#23357)
  BUG: Handle Period in combine (pandas-dev#23350)
  REF: SparseArray imports (pandas-dev#23329)
  CI: Migrate some CircleCI jobs to Azure (pandas-dev#22992)
  DOC: update the is_month_start/is_month_end docstring (pandas-dev#23051)
  Partialy fix issue pandas-dev#23334 - isort pandas/core/groupby directory (pandas-dev#23341)
  TST: Add base test for extensionarray setitem pandas-dev#23300 (pandas-dev#23304)
  API: Add sparse Acessor (pandas-dev#23183)
  PERF: speed up CategoricalIndex.get_loc (pandas-dev#23235)
  fix and test incorrect case in delta_to_nanoseconds (pandas-dev#23302)
  BUG: Handle Datetimelike data in DataFrame.combine (pandas-dev#23317)
  TST: re-enable gbq tests (pandas-dev#23303)
  Switched references of App veyor to azure pipelines in the contributing CI section (pandas-dev#23311)
  isort imports-io (pandas-dev#23332)
  DOC: Added a Multi Index example for the Series.sum method (pandas-dev#23279)
  REF: Make PeriodArray an ExtensionArray (pandas-dev#22862)
  DOC: Added Examples for Series max (pandas-dev#23298)
  API/ENH: tz_localize handling of nonexistent times: rename keyword + add shift option (pandas-dev#22644)
  BUG: Let MultiIndex.set_levels accept any iterable (pandas-dev#23273) (pandas-dev#23291)
  ...
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: df.loc is 100x slower for CategoricalIndex than for normal Index
5 participants