API/BUG: always try to operate inplace when setting with loc/iloc[foo, bar] #39163

jbrockmendel · 2021-01-14T02:49:27Z

closes API: setitem copy/view behavior ndarray vs Categorical vs other EA #38896
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

(The PR title should have a caveat "in cases that make it to BlockManager.setitem)

Discussed briefly on today's call. The main idea, as discussed in #38896 and #33457, is that df[col] = ser should not alter the existing df[col]._values, while df.iloc[:, num] = ser should always try to operate inplace before falling back to casting. This PR focuses on the latter case.

ATM, restricting to cases where we a) get to Block.setitem, b) could do the setitem inplace, and c) are setting all the entries in the array, we have 4 different behaviors. Examples are posted at the bottom of this post.

This PR changes Block.setitem so that in the _can_hold_element case we always operate inplace, always retain the same underlying array.

Existing Behavior

If the new_values being set are categorical, we overwrite existing values and then discard them, do not get a view on new_values. (only can replicate with Series until BUG: setting categorical values into object dtype DataFrame #39136)

ser = pd.Series(np.arange(5), dtype=object)
cat = pd.Categorical(ser)[::-1]

vals = ser.values
ser.iloc[ser.index] = cat

assert (vals == cat).all()            # <-- vals are overwritten
assert ser.dtype == cat.dtype   # < -- vals are discarded
assert not np.shares_memory(ser._values.codes, cat.codes)  # <-- we also dont get a view on the new values

If the new_values are any other EA, we do not overwrite existing values and do get a view on the new_values.

df = pd.DataFrame({"A": np.arange(5, dtype=np.int32)})
arr = pd.array(df["A"].values)  + 1

vals = df.values
df.iloc[df.index, 0] = arr

assert (df.dtypes == arr.dtype).all()      # <-- cast
assert not (vals == df).any(axis=None)    # <-- did not overwrite original

If the new_values are a new non-EA dtype, we overwrite the old values and create a new array, get a view on neither.

df = tm.makeDataFrame()  #  <-- float64
old = df.values
new = np.arange(df.size).astype(np.int16).reshape(df.shape)
df.iloc[:, [0, 1, 2, 3]] = new

assert (old == new).all()
assert not np.shares_memory(df.values, old)
assert not np.shares_memory(df.values, new)

If the new_values have the same dtype as the existing, we overwrite existing and keep the same array

df = tm.makeDataFrame()  #  <-- float64
old = df.values
new = np.arange(df.size).astype(np.float64).reshape(df.shape)
df.iloc[:, [0, 1, 2, 3]] = new

assert (old == new).all()
assert np.shares_memory(df.values, old)
assert not np.shares_memory(df.values, new)

…, bar]

jbrockmendel · 2021-01-14T04:48:11Z

One potential issue here is that we don't have a nice way of doing a not-inplace df.iloc[:, i] = ser (if df.columns is unique we can do df[df.columns[i]] = ser)

…i-setitem-inplace

jreback · 2021-01-20T23:27:42Z

this lgtm. can you merge master and add a whatsnew sub-section (that we can update later for other issues). this is very subtle and need to make it clear what is happening.

…i-setitem-inplace

jbrockmendel · 2021-01-25T21:53:18Z

rebased + green

…i-setitem-inplace

jreback

lgtm. any additional comments cc @pandas-dev/pandas-core

jorisvandenbossche · 2021-02-08T23:50:51Z

Only looked at the whatsnew note for now, will try to take a look at the actual code tomorrow.

I find the whatsnew note a bit hard to follow. Currently it focuses on the aspect of how it impacts a potential viewing array on the data. But that's already a quite advanced use case that I think many users won't follow/do. Doesn't have the change impact on the actual dtypes (something more visible). Eg setting ints in a float column preserves the float dtype? (according to your example in the top post). Starting the whatsnew note with such an example might make it easier to grasp what it's about.

…i-setitem-inplace

jbrockmendel · 2021-02-10T17:20:51Z

updated the whatsnew, is that clearer @jorisvandenbossche ?

jorisvandenbossche · 2021-02-10T17:57:22Z

Thanks, yes, I think that's clearer!

jorisvandenbossche · 2021-02-10T18:00:16Z

Looking at the changed tests, I think one potentially problematic aspect is that object dtype gets preserved now when you start from an "empty" (all-NaN object dtype) DataFrame.
That's mainly a limitation of how we create empty dataframes, but still a breaking change.

jbrockmendel · 2021-02-10T19:06:50Z

when you start from an "empty" (all-NaN object dtype) DataFrame.

Just checking, when you say "empty", you dont mean df.size == 0?

What would you want to do here? Could special-case the all-NaN-object case i guess

jreback · 2021-03-04T13:36:01Z

looks fine, can you rebase and ping on green

jorisvandenbossche · 2021-03-04T22:17:44Z

when you start from an "empty" (all-NaN object dtype) DataFrame.

Just checking, when you say "empty", you dont mean df.size == 0?

What would you want to do here? Could special-case the all-NaN-object case i guess

Sorry for the late answer here. But yes, not size=0, but all NaN (eg df = pd.DataFrame(index=.., columns=..)).

That might indeed require special casing all-NaN object.

jreback · 2021-03-05T00:09:38Z

thanks @jbrockmendel

jorisvandenbossche · 2021-03-05T21:12:08Z

@jbrockmendel can you do the all-NaN object case as a follow-up?

@jreback yes, I know, I just commented and didn't mark it with "request changes", but it would be good if you can read the last comment before merging

jreback · 2021-03-05T21:25:52Z

@jorisvandenbossche i would suggest you request changes

we have a lot of PRa

jbrockmendel · 2021-03-05T23:51:57Z

can you do the all-NaN object case as a follow-up?

sure. let's double-check we're on the same page about what the issue is. The example case is pd.DataFrame(index=..., columns=...) which is all-NA-single-block DataFrame. Should this special treatment cover any other cases? e.g. an all-NA column in a not-all-NA DataFrame?

jorisvandenbossche · 2021-03-08T16:47:31Z

Should this special treatment cover any other cases? e.g. an all-NA column in a not-all-NA DataFrame?

Yes, is the column that is being set is all-NA object dtype (regardless whether other columns are not all-NA or not), it got inferred to a new dtype:

In [7]: df = pd.DataFrame(index=[0, 1], columns=['a', 'b'])

In [8]: df.loc[:, 'a'] = 1

In [9]: df.loc[:, 'b'] = pd.Timestamp("2012-01-01")

In [10]: df.dtypes
Out[10]: 
a             int64
b    datetime64[ns]
dtype: object

jbrockmendel · 2021-03-08T22:15:38Z

im OK with this special case. any objections @jreback @TomAugspurger@phofl before i implement?

API/BUG: always try to operate inplace when setting with loc/iloc[foo…

2d2858f

…, bar]

jbrockmendel added 2 commits January 14, 2021 17:48

update categorical test

6abba68

Merge branch 'master' of https://github.com/pandas-dev/pandas into ap…

97db70f

…i-setitem-inplace

jbrockmendel added the Indexing Related to indexing on series/frames, not to indexes themselves label Jan 19, 2021

This was referenced Jan 19, 2021

API: setitem copy/view behavior ndarray vs Categorical vs other EA #38896

Closed

WIP/REF: BlockManager.setitem_blockwise #39302

Closed

BUG: setting dt64 values into Series[int] incorrectly casting dt64->int #39266

Merged

jreback added this to the 1.3 milestone Jan 20, 2021

jbrockmendel added 5 commits January 20, 2021 15:54

Merge branch 'master' of https://github.com/pandas-dev/pandas into ap…

8bc4546

…i-setitem-inplace

more whatsnew

515726d

Merge branch 'master' of https://github.com/pandas-dev/pandas into ap…

2cdd4cf

…i-setitem-inplace

Merge branch 'master' of https://github.com/pandas-dev/pandas into ap…

d529b9f

…i-setitem-inplace

Merge branch 'master' of https://github.com/pandas-dev/pandas into ap…

e648aa3

…i-setitem-inplace

Merge branch 'master' into api-setitem-inplace

f7ea31a

jbrockmendel mentioned this pull request Jan 28, 2021

API: Series[bool][key] = np.nan -> cast to object #38709

Merged

5 tasks

Merge branch 'master' of https://github.com/pandas-dev/pandas into ap…

df120b1

…i-setitem-inplace

jbrockmendel mentioned this pull request Feb 1, 2021

CI/TST: update exception message, xfail #39546

Merged

Merge branch 'master' into api-setitem-inplace

3fe66b0

jreback approved these changes Feb 7, 2021

View reviewed changes

jbrockmendel mentioned this pull request Feb 7, 2021

BUG/API: make setitem-inplace preserve dtype when possible with PandasArray, IntegerArray, FloatingArray #39044

Closed

4 tasks

jbrockmendel added 2 commits February 9, 2021 21:17

Merge branch 'master' of https://github.com/pandas-dev/pandas into ap…

1f111de

…i-setitem-inplace

re-write whatnsew note

09c8ffd

jbrockmendel mentioned this pull request Feb 12, 2021

REF/API: DataFrame.__setitem__ never operate in-place #39510

Merged

4 tasks

jbrockmendel added 3 commits February 22, 2021 16:25

Merge branch 'master' into api-setitem-inplace

4a1ce99

Merge branch 'master' into api-setitem-inplace

f27025f

Merge branch 'master' into api-setitem-inplace

cffaedf

jbrockmendel mentioned this pull request Mar 2, 2021

[ArrayManager] Add SingleArrayManager to back a Series #40152

Merged

Merge branch 'master' into api-setitem-inplace

71cb80d

jreback merged commit 527c789 into pandas-dev:master Mar 5, 2021

jbrockmendel deleted the api-setitem-inplace branch March 5, 2021 00:15

jbrockmendel mentioned this pull request Mar 31, 2021

BUG: Bug in loc did not change dtype when complete column was assigned #37749

Closed

8 tasks

simonjayhawkins mentioned this pull request Jun 18, 2021

BUG: Adding Series to empty DataFrame can reset dtype to float64 #42099

Closed

3 tasks

simonjayhawkins mentioned this pull request Aug 3, 2021

BUG: Assigning extension array value to series of dtype object fails if element type is array-like #42437

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API/BUG: always try to operate inplace when setting with loc/iloc[foo, bar] #39163

API/BUG: always try to operate inplace when setting with loc/iloc[foo, bar] #39163

jbrockmendel commented Jan 14, 2021

jbrockmendel commented Jan 14, 2021

jreback commented Jan 20, 2021

jbrockmendel commented Jan 25, 2021

jreback left a comment

jorisvandenbossche commented Feb 8, 2021

jbrockmendel commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

jbrockmendel commented Feb 10, 2021

jreback commented Mar 4, 2021

jorisvandenbossche commented Mar 4, 2021

jreback commented Mar 5, 2021

jorisvandenbossche commented Mar 5, 2021

jreback commented Mar 5, 2021

jbrockmendel commented Mar 5, 2021

jorisvandenbossche commented Mar 8, 2021

jbrockmendel commented Mar 8, 2021

API/BUG: always try to operate inplace when setting with loc/iloc[foo, bar] #39163

API/BUG: always try to operate inplace when setting with loc/iloc[foo, bar] #39163

Conversation

jbrockmendel commented Jan 14, 2021

jbrockmendel commented Jan 14, 2021

jreback commented Jan 20, 2021

jbrockmendel commented Jan 25, 2021

jreback left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Feb 8, 2021

jbrockmendel commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

jorisvandenbossche commented Feb 10, 2021

jbrockmendel commented Feb 10, 2021

jreback commented Mar 4, 2021

jorisvandenbossche commented Mar 4, 2021

jreback commented Mar 5, 2021

jorisvandenbossche commented Mar 5, 2021

jreback commented Mar 5, 2021

jbrockmendel commented Mar 5, 2021

jorisvandenbossche commented Mar 8, 2021

jbrockmendel commented Mar 8, 2021