Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: make is_list_like handle non iterable numpy-like arrays correctly #35127

Closed
wants to merge 12 commits into from

Conversation

znicholls
Copy link
Contributor

@znicholls znicholls commented Jul 5, 2020

@pep8speaks
Copy link

pep8speaks commented Jul 5, 2020

Hello @znicholls! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-02-13 08:43:11 UTC

@znicholls znicholls changed the title Non iterable duck arrays is_list_like correctly handles non iterable duck arrays Jul 5, 2020
@znicholls znicholls changed the title is_list_like correctly handles non iterable duck arrays is_list_like correctly handles non iterable numpy-like arrays Jul 5, 2020
@znicholls znicholls changed the title is_list_like correctly handles non iterable numpy-like arrays make is_list_like handle non iterable numpy-like arrays correctly Jul 5, 2020
@znicholls znicholls force-pushed the non-iterable-duck-arrays branch 2 times, most recently from c5d9c17 to ca94b7d Compare July 5, 2020 23:11
@znicholls znicholls marked this pull request as ready for review July 5, 2020 23:11
pandas/_libs/lib.pyx Outdated Show resolved Hide resolved
@jbrockmendel
Copy link
Member

This is likely to have a non-trivial effect on performance. It might be time to implement a separate is_list_like (and possibly is_scalar) with less-strict/EA-customizable behavior (cc @jorisvandenbossche I know this has come up w/r/t geopandas)

Can you identify a subset of is_list_like usages where we need the less-strict behavior?

@znicholls
Copy link
Contributor Author

Can you identify a subset of is_list_like usages where we need the less-strict behavior?

I can check with the other pint-pandas developers.

Can I also ask which benchmarks I should look at to examine whether performance has taken a hit?

@keewis
Copy link
Contributor

keewis commented Jul 6, 2020

I'm not completely sure why, but reverting here for simplicity

yeah, sorry, getattr only checks for AttributeError but pint raises TypeError

@znicholls
Copy link
Contributor Author

yeah, sorry, getattr only checks for AttributeError but pint raises TypeError

I think it would have to be getattr(obj, "ndim", 1) != 0 i.e. you assume everything is iterable unless you have evidence otherwise.

@znicholls
Copy link
Contributor Author

znicholls commented Jul 6, 2020

I think the test failures are unrelated to our change?

# CI / Web and docs failure text (I think this is the relevant bit)
...
WARNING: autodoc: failed to import method 'sparse.from_spmatrix' from module 'DataFrame'; the following exception was raised:
No module named 'DataFrame'
WARNING: autodoc: failed to import method 'sparse.to_coo' from module 'DataFrame'; the following exception was raised:
No module named 'DataFrame'
WARNING: autodoc: failed to import method 'sparse.to_dense' from module 'DataFrame'; the following exception was raised:
No module named 'DataFrame'
WARNING: autodoc: failed to import method 'sparse.from_coo' from module 'Series'; the following exception was raised:
No module named 'Series'
WARNING: autodoc: failed to import method 'sparse.to_coo' from module 'Series'; the following exception was raised:
No module named 'Series'
...

@znicholls znicholls changed the title make is_list_like handle non iterable numpy-like arrays correctly ENH: make is_list_like handle non iterable numpy-like arrays correctly Jul 6, 2020
@jorisvandenbossche
Copy link
Member

This is likely to have a non-trivial effect on performance.

To have an idea about this, I timed the following cases:

a1 = [1, 2, 3]
a2 = np.array([1, 2, 3])
a3 = np.array([1, 2, 3])[0] 
In [2]: %timeit pd.api.types.is_list_like(a1)      
390 ns ± 1.56 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)   # master  
553 ns ± 1.77 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)   # PR

In [3]: %timeit pd.api.types.is_list_like(a2) 
405 ns ± 0.972 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)   # master
408 ns ± 0.745 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)   # PR

In [4]: %timeit pd.api.types.is_list_like(a3)  
395 ns ± 1.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)   # master
411 ns ± 1.55 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)    PR

So for numpy arrays the slowdown is only very limited (I assume because in this case we needed to look up obj.ndim anyway, so then then getattr(obj, "ndim") seems to be similar). For actual lists, the slowdown is more significant, though (but I would think the array cases are most relevant for performance).

@jorisvandenbossche
Copy link
Member

It might be time to implement a separate is_list_like (and possibly is_scalar) with less-strict/EA-customizable behavior

It's not only EAs, though, but eg also for object dtype (see #26333, #27911)

But indeed, a less strict is_list_like check depending on the dtype of the involved data might be worth exploring, although not that easy I think. The first thing would be to identify the cases where this less strict case is actually important. I suppose the problems in pint-pandas mainly occur in indexing + arithmetic/comparison operations ?

@andrewgsavage
Copy link

andrewgsavage commented Jul 6, 2020

These were the tests which started passing when deleting the __iter__ method from the scalar class to make is_list_like return False:

  • TestGroupby::test_groupby_apply_identity
  • TestMethods::test_fillna_copy_frame
  • TestMethods::test_fillna_copy_series
  • TestMissing::test_fillna_scalar
  • TestMissing::test_fillna_series
  • TestMissing::test_fillna_frame
  • TestReshaping::test_unstack[series-index1]
  • TestReshaping::test_unstack[series-index3]
  • TestReshaping::test_unstack[frame-index1]
  • TestReshaping::test_unstack[frame-index3]
  • TestSetitem::test_setitem_mask_broadcast[loc]
  • TestSetitem::test_setitem_sequence_broadcasts[True]
  • TestSetitem::test_setitem_integer_array[True-list]
  • TestSetitem::test_setitem_integer_array[True-integer-array]
  • TestSetitem::test_setitem_integer_array[True-numpy-array]
  • TestSetitem::test_setitem_slice[True]
  • TestSetitem::test_setitem_loc_iloc_slice

@znicholls
Copy link
Contributor Author

znicholls commented Jul 7, 2020

It looks like using numpy.iterable speeds up the list and array cases but not the numpy scalar one (which I think makes sense as you have to do some more thinking on that one). (This comment is pending tests passing...)

import numpy as np
import pandas as pd

a1 = [1, 2, 3]
a2 = np.array([1, 2, 3])
a3 = np.array([1, 2, 3])[0]


%timeit pd.api.types.is_list_like(a1)
289 ns ± 16.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR
447 ns ± 7.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

%timeit pd.api.types.is_list_like(a2)
302 ns ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR
466 ns ± 4.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

%timeit pd.api.types.is_list_like(a3)
596 ns ± 2.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR
440 ns ± 1.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

@znicholls
Copy link
Contributor Author

znicholls commented Jul 7, 2020

Can you identify a subset of is_list_like usages where we need the less-strict behavior?

Having dug a bit deeper, we're going to need to back to the drawing board on this. My impression is that in order to be handled by is_list_like, is_scalar and item_from_zerodim correctly, the class has to basically be a numpy array (or sub-class thereof). I'll go back to the pint-pandas devs and we can discuss how we go forward there, including identifying the cases where a less strict is_list_like, is_scalar and item_from_zerodim would solve our problems.

I'm happy to close this in the meantime. However it would be great if we could reopen #22582 to at least add pint-pandas to the docs (buggy as it is).

@hgrecco
Copy link

hgrecco commented Jul 7, 2020

I would be nice to talk to geopandas devs to see if we can join forces. We might be facing the same problem.

@jorisvandenbossche
Copy link
Member

@hgrecco you are alread talking to them ;) (well, at least, I am one of the geopandas devs)
See also my comment at #35131 (comment): although very much related, I think it's not exactly the same. As for GeoPandas, the scalar elements we put in a Series are actually iterable (in contrast to the (duck-typed) numpy scalars, which are iterable in theory (have the dunder method), but raise once you try to iterate).

@jorisvandenbossche
Copy link
Member

It looks like using numpy.iterable speeds up the list and array cases

It seems that np.iterable is actually just a try/except of iter(value):

    try:
        iter(y)
    except TypeError:
        return False
    return True

(if we decide to use this, I would prefer doing the code above explicitly instead of using np.iterable, as I think most people are not familiar with that numpy method, and the code is simple).

But so basically it is a choosing between isinstance(value, Iterable) + special case check to exclude ndim==0 vs try/except of iter(value) (not fully sure if there are other consequences of the choice between both)

@hgrecco
Copy link

hgrecco commented Jul 7, 2020

@jorisvandenbossche thanks for the clarification.

@znicholls Have you tried reinstating getattr(obj, "ndim", 1) != 0 and moving it first? Python's and/or are shortcircuit operators so we might want to put first those checks which are fast and likely to return false.

@znicholls
Copy link
Contributor Author

Reverting the changes as suggested by @hgrecco leads to a slowdown for lists, no change for numpy arrays and a speedup for numpy scalars (essentially the reverse of previously)

import numpy as np
import pandas as pd

a1 = [1, 2, 3]
a2 = np.array([1, 2, 3])
a3 = np.array([1, 2, 3])[0]


%timeit pd.api.types.is_list_like(a1)
640 ns ± 6.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR
447 ns ± 7.82 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

%timeit pd.api.types.is_list_like(a2)
414 ns ± 17.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR
466 ns ± 4.96 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

%timeit pd.api.types.is_list_like(a3)
385 ns ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) #PR
440 ns ± 1.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # master

@znicholls
Copy link
Contributor Author

We still have the is_scalar and item_from_zerodim issues of course...

@znicholls
Copy link
Contributor Author

@jbrockmendel I went through and identified where different is_list_like and is_scalar would be needed, see master...znicholls:handle-extension-arrays. Is that helpful?

@znicholls znicholls marked this pull request as draft July 8, 2020 03:34
@znicholls
Copy link
Contributor Author

The backstory is #35131. Hopefully that explains why we are looking to make such a change? (@jorisvandenbossche just double checking no progress has been made elsewhere that I've missed?)

A concern about performance impacts was raised in this comment #35127 (comment). After some discussion, we arrived at #35127 (comment).

For completeness, we also did some checks about performance penalties, e.g. #35127 (comment), #35127 (comment) and #35127 (comment). None of our solutions was perfect.

I am not sure what your proposed implementation even looks like.

That makes sense, because I am also unsure. I've only got to step 1 of #35127 (comment). Based on the 'hot spots' identified in master...znicholls:handle-extension-arrays, I am looking for advice about whether we could use less strict (potentially slower) implementations of is_list_like in these places.

If we can use less strict implementations, my only proposed implementation is the changes made in this PR i.e. master...znicholls:non-iterable-duck-arrays. In short, the less strict implementation would check for objects which are numpy-style zero-dimensional arrays with not (hasattr(obj, "ndim") and obj.ndim == 0) rather than not (util.is_array(obj) and obj.ndim == 0) because the latter only works for numpy arrays and doesn't work for numpy-style arrays (such as Pint).

@jreback
Copy link
Contributor

jreback commented Feb 11, 2021

@znicholls what is the status here

@znicholls
Copy link
Contributor Author

znicholls commented Feb 11, 2021

what is the status here

I would greatly appreciate your (or another pandas developer) thoughts on #35127 (comment). I'm not sure I can sum it up any more succinctly than what is in that comment.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2021

If we can use less strict implementations, my only proposed implementation is the changes made in this PR i.e. master...znicholls:non-iterable-duck-arrays. In short, the less strict implementation would check for objects which are numpy-style zero-dimensional arrays with not (hasattr(obj, "ndim") and obj.ndim == 0) rather than not (util.is_array(obj) and obj.ndim == 0) because the latter only works for numpy arrays and doesn't work for numpy-style arrays (such as Pint).

i think this is reasonable. of course asv's will tell us whether its an acceptable perf hit. trying to balance additional complexity (seems minor in this case) with perf vs enhancements.

@znicholls
Copy link
Contributor Author

i think this is reasonable. of course asv's will tell us whether its an acceptable perf hit. trying to balance additional complexity (seems minor in this case) with perf vs enhancements.

Ok great thank you, I will start on that path then.

@znicholls
Copy link
Contributor Author

@jreback I have tried to start small with #39790, if you could take a look that would be great

@mroeschke
Copy link
Member

Thanks for the PR draft, but it appears this draft has stalled in development. Probably best if this change starts in a smaller PR with continuation of #39790 (let us know if you'd like to continue in that PR. Closing.

@burnpanck
Copy link
Contributor

What exactly is it that is missing here? There seems to be quite a number of issues in many downstream projects, all stalling on #35131. I myself would love to see pint-pandas move forward and would be willing to picking up this PR.

If I understand correctly, the main concern with the proposed solution here is the slight performance regression. To me, that then asks for performance re-optimisation rather than introducing two different versions of is_list_like that do slightly different things - the latter is calling for all kinds of subtle bugs. The size of the diff in #39830 (2500 changed lines) with the solution presented here (150 changed lines) also speaks for this approach here.

As for the performance re-optimisation: It seems there is no regression on actual numpy arrays, only on real lists. Under the assumption that real lists are the only relevant non-array types, we might be able to get by with a simple short-cut test for actual lists. Would an iteration on this PR along these lines be appreciated?

@jreback
Copy link
Contributor

jreback commented Nov 25, 2021

@burnpanck like anything in pandas a proper well formed and tested PR would be considered

@znicholls
Copy link
Contributor Author

The size of the diff in #39830 (2500 changed lines) with the solution presented here (150 changed lines) also speaks for this approach here.

I just rebased #39830 so you can see the real diff. It's only 59 lines and only 1 in the actual codebase (the rest are all tests). https://github.com/pandas-dev/pandas/compare/master...znicholls:update-is-list-like?expand=1

Thanks for picking up @burnpanck, as you can tell I had a few goes but always hit a wall at some point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Identify numpy-like zero-dimensional arrays as non-iterable