Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow-up #20347: incorporate review about _get_series_list #20923

Merged
merged 3 commits into from
May 4, 2018

Conversation

h-vetinari
Copy link
Contributor

@h-vetinari h-vetinari commented May 2, 2018

Closes #20922

recap of #20347 :
Enabling alignment on top of the (v.0.22-)legal list of lists was relatively complex. The allowed argument types implemented in #20347 are as follows:

Type of (element of) "others"        | alone | within list-like
---------------------------------------------------------------------
pd.Series                            |  yes  | yes
pd.Index                             |  yes  | yes
np.ndarray (1-dim)                   |  yes  | yes
DataFrame                            |  yes  | no
np.ndarray (2-dim)                   |  yes  | no
iterators, dict-views, etc.          |  yes  | **
anything else w/ is_list_like = True |  yes  | **

** to be permitted, list-likes (say nxt) within a list-like others must pass (essentially)
all(not is_list_like(x) for x in nxt)
and not be a DF.

Open review points from #20347 from @jreback (and my responses) reproduced in review comments.

PS. Managed to have a brainfart and there's a bug in the RC docs - very sorry. Fixed in this PR.

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to review of @jreback

# without copy, this could change "others"
# that was passed to str.cat
others = others.copy()
others.index = idx
return ([others[x] for x in others], fu_wrn)
return ([others[x] for x in others], warn)
elif isinstance(others, np.ndarray) and others.ndim == 2:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback :

this is wrong
i don’t think we can align a ndarray at all like this
let’s can ndarray a that are > 1 dim

My response:
The DF-constructor works as expected for a 2-dim ndarray, but I haven't checked if this is tested behaviour. (essentially, df == DataFrame(df.values, columns=df.columns, index=df.index))

I would suggest not to can 2-dim ndarrays, because they are necessary to avoid alignment on the deprecation path for join:

[...] To disable alignment (the behavior before v.0.23) and silence this warning, use .values on any Series/Index/DataFrame in others. [...]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for this, basically see what happens when I have a non-default index with the Series (e.g. 2,3, 4) or something and it gets aligned with the 0,1,2 of the ndarray-converted-to-DataFrame. It will 'work' but is wrong.

Copy link
Contributor Author

@h-vetinari h-vetinari May 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a general comment before all the cases for others because it seems this point wasn't clear:

# Generally speaking, all objects without an index inherit the index
# `idx` of the calling Series/Index - i.e. must have matching length.
# Objects with an index (i.e. Series/Index/DataFrame) keep their own
# index, *unless* ignore_index is set to True.

So, the case you mention does not happen, because an ndarray is always automatically aligned with the calling Series (of course, the lengths must match for this to work).

There are several tests with objects with different indexes, both with join is None (so with force-realign; successful but raising a warning), as well as with the different join-types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i c, ok then

elif isinstance(others, np.ndarray) and others.ndim == 2:
others = DataFrame(others, index=idx)
return ([others[x] for x in others], False)
elif is_list_like(others):
others = list(others) # ensure iterators do not get read twice etc
if all(is_list_like(x) for x in others):
los = []
fu_wrn = False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback :

can u name this parameter just warn

Done

fu_wrn = fu_wrn or fwn
return (los, fu_wrn)
# test if there is a mix of list-like and non-list-like (e.g. str)
elif (any(is_list_like(x) for x in others)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback

you can make this simpler by just checking for all is not list like (eg strings)
anything else will fail thru to the TypeError

Done

while others:
nxt = others.pop(0) # list-like as per check above
# safety for iterators and other non-persistent list-likes
Copy link
Contributor Author

@h-vetinari h-vetinari May 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback :

this whole section needs some work it’s way too hard to read and follow

I tried to comment in detail what's happening - which part is unclear?

is_legal = ((no_deep and nxt.dtype == object)
or all((isinstance(x, compat.string_types)
or (not is_list_like(x) and isnull(x))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback :

isnull already checks for None
only 1d objects are valid here (or all scalars)
do this check up front

I simplified the check, but I can't put it any further upfront (unless I remove the fast track). Need to convert iterators first, and the checking for fast-track for Series/Index etc. also needs to be done before

@h-vetinari
Copy link
Contributor Author

@TomAugspurger from #20922

Some requests for simplification. Most of the complexity seems to be in the code inferring what kind of regime we're in (one or more alignable objects vs. one or more unalignable objects vs. mix). Being strict here (after deprecating) or requiring additional user input may be worthwhile from a maintenance perspective.

What do you mean by being strict. Could you maybe edit the table from above to reflect what you think should be allowed?

@h-vetinari
Copy link
Contributor Author

h-vetinari commented May 2, 2018

@TomAugspurger
Since I saw you're doing an RC2 due to #20924, would you consider merging this PR (or a separate one) for fixing the embarassing (for me) doc-error?

This PR will (almost certainly) be fully green in ~1h.

@TomAugspurger
Copy link
Contributor

Nope, just doing build stuff right now.

@codecov
Copy link

codecov bot commented May 2, 2018

Codecov Report

Merging #20923 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20923      +/-   ##
==========================================
- Coverage   91.81%   91.81%   -0.01%     
==========================================
  Files         153      153              
  Lines       49471    49470       -1     
==========================================
- Hits        45422    45421       -1     
  Misses       4049     4049
Flag Coverage Δ
#multiple 90.21% <100%> (-0.01%) ⬇️
#single 41.84% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/strings.py 98.62% <100%> (-0.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce4ab82...3bf38a0. Read the comment docs.

@jreback jreback added Strings String extension data type and string data Clean labels May 3, 2018
# without copy, this could change "others"
# that was passed to str.cat
others = others.copy()
others.index = idx
return ([others[x] for x in others], fu_wrn)
return ([others[x] for x in others], warn)
elif isinstance(others, np.ndarray) and others.ndim == 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for this, basically see what happens when I have a non-default index with the Series (e.g. 2,3, 4) or something and it gets aligned with the 0,1,2 of the ndarray-converted-to-DataFrame. It will 'work' but is wrong.

@jreback
Copy link
Contributor

jreback commented May 3, 2018

other than my comment above the cleanup looks good.

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I think the point you mention is a misunderstanding, but I pushed another commit to make it clearer what's happening.

# without copy, this could change "others"
# that was passed to str.cat
others = others.copy()
others.index = idx
return ([others[x] for x in others], fu_wrn)
return ([others[x] for x in others], warn)
elif isinstance(others, np.ndarray) and others.ndim == 2:
Copy link
Contributor Author

@h-vetinari h-vetinari May 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a general comment before all the cases for others because it seems this point wasn't clear:

# Generally speaking, all objects without an index inherit the index
# `idx` of the calling Series/Index - i.e. must have matching length.
# Objects with an index (i.e. Series/Index/DataFrame) keep their own
# index, *unless* ignore_index is set to True.

So, the case you mention does not happen, because an ndarray is always automatically aligned with the calling Series (of course, the lengths must match for this to work).

There are several tests with objects with different indexes, both with join is None (so with force-realign; successful but raising a warning), as well as with the different join-types.

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we close this, maybe this question from the last PR merits some further discussion. @TomAugspurger @jreback @jorisvandenbossche

@@ -311,7 +311,7 @@ All one-dimensional list-likes can be arbitrarily combined in a list-like contai

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another point of @TomAugspurger in #20347, responding to the line
All one-dimensional list-likes can be arbitrarily combined in a list-like container (including iterators, ``dict``-views, etc.):
(which is just above this line pointer):

@TomAugspurger

question: what happens with a dict? Do we use the keys, or does it transform to a series and get aligned?

Response 1:

I was talking about d.keys() or d.values(). For passing a dict directly, currently the keys would get read (as that's what x in d would return).

Response 2:

I could of course add another safety that maps d to d.values().
However, I tend to think that this could be left to the python-skills of the user as well -- a dictionary is not "list-like" from the POV of normal python usage (whereas its keys and values are, see d = dict(zip(keys, values))).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a dict is treated consistently here how we handle list-likes else where, we effectively call list on it

@@ -311,7 +311,7 @@ All one-dimensional list-likes can be arbitrarily combined in a list-like contai

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a dict is treated consistently here how we handle list-likes else where, we effectively call list on it

# without copy, this could change "others"
# that was passed to str.cat
others = others.copy()
others.index = idx
return ([others[x] for x in others], fu_wrn)
return ([others[x] for x in others], warn)
elif isinstance(others, np.ndarray) and others.ndim == 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i c, ok then

@jreback jreback added this to the 0.23.0 milestone May 3, 2018
@h-vetinari
Copy link
Contributor Author

a dict is treated consistently here how we handle list-likes else where, we effectively call list on it

In that case, I think this PR is good to go.

@h-vetinari
Copy link
Contributor Author

@jreback @TomAugspurger @jorisvandenbossche
re: treatment of dicts:
FWIW, concat does a very similar kind of operation to str.cat, and does treat dicts differently (than calling list on them):

d = {'a' : pd.Series([0, 1]), 'b': pd.Series(['c', 'd'])}
pd.concat(d, axis=1)
#    a  b
# 0  0  c
# 1  1  d
pd.concat(d.values(), axis=1)
#    0  1
# 0  0  c
# 1  1  d
pd.concat(list(d), axis=1)
# TypeError

OTOH, I'm honestly fine to leave this PR as it is. Special treatment for dicts should be a separate issue, IMHO.

@jreback jreback merged commit ef019fa into pandas-dev:master May 4, 2018
@jreback
Copy link
Contributor

jreback commented May 4, 2018

thanks @h-vetinari good followup!

as far as dicts go, certainly welcome clarification if needed (PR / issue all ok), or can just wait and see if that usecase is seen in real life

@h-vetinari
Copy link
Contributor Author

@jreback
Glad this PR is done, TBH. Thanks for all the reviews and feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants