Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/BUG: always try to operate inplace when setting with loc/iloc[foo, bar] #39163

Merged
merged 17 commits into from
Mar 5, 2021

Conversation

jbrockmendel
Copy link
Member

(The PR title should have a caveat "in cases that make it to BlockManager.setitem)

Discussed briefly on today's call. The main idea, as discussed in #38896 and #33457, is that df[col] = ser should not alter the existing df[col]._values, while df.iloc[:, num] = ser should always try to operate inplace before falling back to casting. This PR focuses on the latter case.

ATM, restricting to cases where we a) get to Block.setitem, b) could do the setitem inplace, and c) are setting all the entries in the array, we have 4 different behaviors. Examples are posted at the bottom of this post.

This PR changes Block.setitem so that in the _can_hold_element case we always operate inplace, always retain the same underlying array.


Existing Behavior

  1. If the new_values being set are categorical, we overwrite existing values and then discard them, do not get a view on new_values. (only can replicate with Series until BUG: setting categorical values into object dtype DataFrame #39136)
ser = pd.Series(np.arange(5), dtype=object)
cat = pd.Categorical(ser)[::-1]

vals = ser.values
ser.iloc[ser.index] = cat

assert (vals == cat).all()            # <-- vals are overwritten
assert ser.dtype == cat.dtype   # < -- vals are discarded
assert not np.shares_memory(ser._values.codes, cat.codes)  # <-- we also dont get a view on the new values
  1. If the new_values are any other EA, we do not overwrite existing values and do get a view on the new_values.
df = pd.DataFrame({"A": np.arange(5, dtype=np.int32)})
arr = pd.array(df["A"].values)  + 1

vals = df.values
df.iloc[df.index, 0] = arr

assert (df.dtypes == arr.dtype).all()      # <-- cast
assert not (vals == df).any(axis=None)    # <-- did not overwrite original
  1. If the new_values are a new non-EA dtype, we overwrite the old values and create a new array, get a view on neither.
df = tm.makeDataFrame()  #  <-- float64
old = df.values
new = np.arange(df.size).astype(np.int16).reshape(df.shape)
df.iloc[:, [0, 1, 2, 3]] = new

assert (old == new).all()
assert not np.shares_memory(df.values, old)
assert not np.shares_memory(df.values, new)
  1. If the new_values have the same dtype as the existing, we overwrite existing and keep the same array
df = tm.makeDataFrame()  #  <-- float64
old = df.values
new = np.arange(df.size).astype(np.float64).reshape(df.shape)
df.iloc[:, [0, 1, 2, 3]] = new

assert (old == new).all()
assert np.shares_memory(df.values, old)
assert not np.shares_memory(df.values, new)

@jbrockmendel
Copy link
Member Author

One potential issue here is that we don't have a nice way of doing a not-inplace df.iloc[:, i] = ser (if df.columns is unique we can do df[df.columns[i]] = ser)

@jreback
Copy link
Contributor

jreback commented Jan 20, 2021

this lgtm. can you merge master and add a whatsnew sub-section (that we can update later for other issues). this is very subtle and need to make it clear what is happening.

@jbrockmendel
Copy link
Member Author

rebased + green

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. any additional comments cc @pandas-dev/pandas-core

@jorisvandenbossche
Copy link
Member

Only looked at the whatsnew note for now, will try to take a look at the actual code tomorrow.

I find the whatsnew note a bit hard to follow. Currently it focuses on the aspect of how it impacts a potential viewing array on the data. But that's already a quite advanced use case that I think many users won't follow/do. Doesn't have the change impact on the actual dtypes (something more visible). Eg setting ints in a float column preserves the float dtype? (according to your example in the top post). Starting the whatsnew note with such an example might make it easier to grasp what it's about.

@jbrockmendel
Copy link
Member Author

updated the whatsnew, is that clearer @jorisvandenbossche ?

@jorisvandenbossche
Copy link
Member

Thanks, yes, I think that's clearer!

@jorisvandenbossche
Copy link
Member

Looking at the changed tests, I think one potentially problematic aspect is that object dtype gets preserved now when you start from an "empty" (all-NaN object dtype) DataFrame.
That's mainly a limitation of how we create empty dataframes, but still a breaking change.

@jbrockmendel
Copy link
Member Author

when you start from an "empty" (all-NaN object dtype) DataFrame.

Just checking, when you say "empty", you dont mean df.size == 0?

What would you want to do here? Could special-case the all-NaN-object case i guess

@jreback
Copy link
Contributor

jreback commented Mar 4, 2021

looks fine, can you rebase and ping on green

@jorisvandenbossche
Copy link
Member

when you start from an "empty" (all-NaN object dtype) DataFrame.

Just checking, when you say "empty", you dont mean df.size == 0?

What would you want to do here? Could special-case the all-NaN-object case i guess

Sorry for the late answer here. But yes, not size=0, but all NaN (eg df = pd.DataFrame(index=.., columns=..)).

That might indeed require special casing all-NaN object.

@jreback jreback merged commit 527c789 into pandas-dev:master Mar 5, 2021
@jreback
Copy link
Contributor

jreback commented Mar 5, 2021

thanks @jbrockmendel

@jbrockmendel jbrockmendel deleted the api-setitem-inplace branch March 5, 2021 00:15
@jorisvandenbossche
Copy link
Member

@jbrockmendel can you do the all-NaN object case as a follow-up?

@jreback yes, I know, I just commented and didn't mark it with "request changes", but it would be good if you can read the last comment before merging

@jreback
Copy link
Contributor

jreback commented Mar 5, 2021

@jorisvandenbossche i would suggest you request changes

we have a lot of PRa

@jbrockmendel
Copy link
Member Author

can you do the all-NaN object case as a follow-up?

sure. let's double-check we're on the same page about what the issue is. The example case is pd.DataFrame(index=..., columns=...) which is all-NA-single-block DataFrame. Should this special treatment cover any other cases? e.g. an all-NA column in a not-all-NA DataFrame?

@jorisvandenbossche
Copy link
Member

Should this special treatment cover any other cases? e.g. an all-NA column in a not-all-NA DataFrame?

Yes, is the column that is being set is all-NA object dtype (regardless whether other columns are not all-NA or not), it got inferred to a new dtype:

In [7]: df = pd.DataFrame(index=[0, 1], columns=['a', 'b'])

In [8]: df.loc[:, 'a'] = 1

In [9]: df.loc[:, 'b'] = pd.Timestamp("2012-01-01")

In [10]: df.dtypes
Out[10]: 
a             int64
b    datetime64[ns]
dtype: object

@jbrockmendel
Copy link
Member Author

im OK with this special case. any objections @jreback @TomAugspurger@phofl before i implement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: setitem copy/view behavior ndarray vs Categorical vs other EA
3 participants