Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: ExtensionBlock.set not setting values inplace #32831

Merged
merged 2 commits into from
Mar 21, 2020

Conversation

jbrockmendel
Copy link
Member

In trying to figure out the difference between Block.set vs Block.setitem I found that ExtensionBlock.set is not inplace like it is supposed to be. Traced this back to a problem in CategoricalBlock.should_store, which this fixes+tests.

In separate passes I would like to

  • rename set and setitem to something like "setitem_inplace" and "setitem_newobj"
  • ATM setitem is sometimes inplace; I'd like to make that consistent.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation labels Mar 21, 2020
@jreback jreback added this to the 1.1 milestone Mar 21, 2020
@jreback jreback merged commit d427335 into pandas-dev:master Mar 21, 2020
@jreback
Copy link
Contributor

jreback commented Mar 21, 2020

does this need a whatsnew as your example IS user facing? if so, can you add in a followon

@jbrockmendel
Copy link
Member Author

does this need a whatsnew as your example IS user facing? if so, can you add in a followon

yes, will do.

@jreback two questions on Block.setitem behavior (AFAICT you wrote at least one of the two original implementations)

In Block.setitem we have a check

        elif (
            exact_match
            and is_categorical_dtype(arr_value.dtype)
            and not is_categorical_dtype(values)
        ):
            # GH25495 - If the current dtype is not categorical,
            # we need to create a new categorical block
            values[indexer] = value
            return self.make_block(Categorical(self.values, dtype=arr_value.dtype))

It isn't clear why we need exact_match here. If we remove that, there is one test that fails because it expects to retain the non-Categorical dtype when setting only 2 of the 3 values with a length-2 Categorical. Is this important? (not having this restriction would make it easier to simplify this method)

Second, the next check in Block.setitem is:

        # if we are an exact match (ex-broadcasting),
        # then use the resultant dtype
        elif exact_match:
            # We are setting _all_ of the array's values, so can cast to new dtype
            values[indexer] = value
            values = values.astype(arr_value.dtype, copy=False)

The non-obvious thing here is why we are over-writing values instead of just using value (which would also save an astype!). CoW semantics are hard, and it seems really easy for some of these to be careful and intentional and others not to be.

@jorisvandenbossche
Copy link
Member

@jbrockmendel this PR also had the consequence that __getitem__ modifies the values in place, which IMO is not the desired behaviour:

In [16]: cat = pd.Categorical(["A", "B", "C"]) 
    ...: df = pd.DataFrame({'cat': cat, 'int': [1, 2, 3]})  

In [17]: df['cat'] = cat[::-1]      

In [18]: cat     
Out[18]: 
[C, B, A]
Categories (3, object): [A, B, C]

Setting a single column with getitem with extension arrays should IMO simply override the existing column / values.

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche did you mean __setitem__ modifies in-place? If you really meant __getitem__ then that is definitely bad

@jorisvandenbossche
Copy link
Member

Indeed, __setitem__ of course ;) But you already have seen the issue as well, so let's discuss it there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants