BUG: ExtensionBlock.set not setting values inplace #32831

jbrockmendel · 2020-03-19T18:14:57Z

In trying to figure out the difference between Block.set vs Block.setitem I found that ExtensionBlock.set is not inplace like it is supposed to be. Traced this back to a problem in CategoricalBlock.should_store, which this fixes+tests.

In separate passes I would like to

rename set and setitem to something like "setitem_inplace" and "setitem_newobj"
ATM setitem is sometimes inplace; I'd like to make that consistent.

…t_vs_setitem

jreback · 2020-03-21T21:00:53Z

does this need a whatsnew as your example IS user facing? if so, can you add in a followon

jbrockmendel · 2020-03-21T22:57:48Z

does this need a whatsnew as your example IS user facing? if so, can you add in a followon

yes, will do.

@jreback two questions on Block.setitem behavior (AFAICT you wrote at least one of the two original implementations)

In Block.setitem we have a check

        elif (
            exact_match
            and is_categorical_dtype(arr_value.dtype)
            and not is_categorical_dtype(values)
        ):
            # GH25495 - If the current dtype is not categorical,
            # we need to create a new categorical block
            values[indexer] = value
            return self.make_block(Categorical(self.values, dtype=arr_value.dtype))

It isn't clear why we need exact_match here. If we remove that, there is one test that fails because it expects to retain the non-Categorical dtype when setting only 2 of the 3 values with a length-2 Categorical. Is this important? (not having this restriction would make it easier to simplify this method)

Second, the next check in Block.setitem is:

        # if we are an exact match (ex-broadcasting),
        # then use the resultant dtype
        elif exact_match:
            # We are setting _all_ of the array's values, so can cast to new dtype
            values[indexer] = value
            values = values.astype(arr_value.dtype, copy=False)

The non-obvious thing here is why we are over-writing values instead of just using value (which would also save an astype!). CoW semantics are hard, and it seems really easy for some of these to be careful and intentional and others not to be.

jorisvandenbossche · 2020-04-10T11:40:05Z

@jbrockmendel this PR also had the consequence that __getitem__ modifies the values in place, which IMO is not the desired behaviour:

In [16]: cat = pd.Categorical(["A", "B", "C"]) 
    ...: df = pd.DataFrame({'cat': cat, 'int': [1, 2, 3]})  

In [17]: df['cat'] = cat[::-1]      

In [18]: cat     
Out[18]: 
[C, B, A]
Categories (3, object): [A, B, C]

Setting a single column with getitem with extension arrays should IMO simply override the existing column / values.

jbrockmendel · 2020-04-10T14:55:40Z

@jorisvandenbossche did you mean __setitem__ modifies in-place? If you really meant __getitem__ then that is definitely bad

jorisvandenbossche · 2020-04-10T17:11:36Z

Indeed, __setitem__ of course ;) But you already have seen the issue as well, so let's discuss it there.

BUG: ExtensionBlock.set not setting values inplace

fd866b7

ShaharNaveh mentioned this pull request Mar 19, 2020

DOC: CI failure due to fsspec deprecation warning #32832

Closed

Merge branch 'master' of https://github.com/pandas-dev/pandas into se…

8631dac

…t_vs_setitem

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation labels Mar 21, 2020

jreback added this to the 1.1 milestone Mar 21, 2020

jreback merged commit d427335 into pandas-dev:master Mar 21, 2020

jbrockmendel deleted the set_vs_setitem branch March 22, 2020 01:06

SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020

BUG: ExtensionBlock.set not setting values inplace (pandas-dev#32831)

21cd3f0

jbrockmendel added a commit to jbrockmendel/pandas that referenced this pull request Mar 23, 2020

BUG: ExtensionBlock.set not setting values inplace (pandas-dev#32831)

a5a09d3

jbrockmendel mentioned this pull request Mar 24, 2020

Dataframe change alters original array used in creation #32960

Closed

jorisvandenbossche mentioned this pull request Apr 10, 2020

REGR: setting column with setitem should not modify existing array inplace #33457

Open

TomAugspurger mentioned this pull request Jul 13, 2020

API: Honor copy for dict-input in DataFrame #34872

Closed

jorisvandenbossche mentioned this pull request Jul 13, 2020

REGR: setting column with setitem should not modify existing array inplace #35266

Closed

simonjayhawkins mentioned this pull request Jul 22, 2020

BUG: df reassignment following reorder_categories changed behavior in 1.1.0rc0 #35369

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: ExtensionBlock.set not setting values inplace #32831

BUG: ExtensionBlock.set not setting values inplace #32831

jbrockmendel commented Mar 19, 2020

jreback commented Mar 21, 2020

jbrockmendel commented Mar 21, 2020

jorisvandenbossche commented Apr 10, 2020

jbrockmendel commented Apr 10, 2020

jorisvandenbossche commented Apr 10, 2020

BUG: ExtensionBlock.set not setting values inplace #32831

BUG: ExtensionBlock.set not setting values inplace #32831

Conversation

jbrockmendel commented Mar 19, 2020

jreback commented Mar 21, 2020

jbrockmendel commented Mar 21, 2020

jorisvandenbossche commented Apr 10, 2020

jbrockmendel commented Apr 10, 2020

jorisvandenbossche commented Apr 10, 2020