Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Bug in loc did not change dtype when complete column was assigned #37749

Closed
wants to merge 40 commits into from

Conversation

phofl
Copy link
Member

@phofl phofl commented Nov 10, 2020

@phofl phofl added Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves labels Nov 10, 2020
@@ -1550,6 +1551,13 @@ def _setitem_with_indexer(self, indexer, value):
val = list(value.values()) if isinstance(value, dict) else value
blk = self.obj._mgr.blocks[0]
take_split_path = not blk._can_hold_element(val)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be the else condtiion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, value can be anything from int, float to numpy array. I think this check is only necessary if we have Series or DataFrame. Maybe with an array?

@@ -298,6 +299,36 @@ def test_iloc_setitem_bool_indexer(self, klass):
expected = DataFrame({"flag": ["x", "y", "z"], "value": [2, 3, 4]})
tm.assert_frame_equal(df, expected)

def test_setitem_complete_columns_different_dtypes(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an int64 column in the original frame (that is not selected).

also can you test with Int64.

Copy link
Member Author

@phofl phofl Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the column.

It seems like astype does not work for Int64 if the input is from object dtype. Should I add a completly new test or is there a trick I am not aware of?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's right, you can .astype('int64').astype(dtype)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much, that is pretty obvious. Parametrized the test now

expected = DataFrame({"A": ["a", "b"], "B": [1, 2], "C": [3, 4]}, dtype="int64")
tm.assert_frame_equal(df, expected)

def test_setitem_single_column_as_series_different_dtype(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about (add to the original frame) and test Int64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@pep8speaks
Copy link

pep8speaks commented Nov 13, 2020

Hello @phofl! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-02-13 20:38:58 UTC

@phofl phofl changed the title BUG: Bug in loc did not change dtype when complete columne was assigned BUG: Bug in loc did not change dtype when complete column was assigned Nov 14, 2020
� Conflicts:
�	doc/source/whatsnew/v1.2.0.rst
�	pandas/tests/frame/indexing/test_setitem.py
�	pandas/tests/indexing/test_iloc.py
@@ -289,6 +289,27 @@ def test_setitem_periodindex(self):
assert isinstance(rs.index, PeriodIndex)
tm.assert_index_equal(rs.index, rng)

@pytest.mark.parametrize("klass", [list, np.array])
def test_iloc_setitem_bool_indexer(self, klass):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test name is good, belongs in tests.indexing.test_iloc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

)
expected = DataFrame(
{
"date0": [to_datetime("2015-01-01"), to_datetime("2016-01-01")],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use Timestamp instead of to_datetime

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@phofl
Copy link
Member Author

phofl commented Nov 21, 2020

The case described in #37593 fails now again, because split path can not handle Series assignment into a single cell. Will look into this further tomorrow

� Conflicts:
�	doc/source/whatsnew/v1.2.0.rst
�	pandas/tests/frame/indexing/test_setitem.py
@phofl
Copy link
Member Author

phofl commented Nov 22, 2020

Puuuuh,

cc @jbrockmendel
The case

df = DataFrame(columns=["a"], index=[0])
rhs = Series([1, 2, 3])
df.iloc[0, 0] = rhs

runs now through _setitem_with_indexer_split_path instead of _setitem_single_block. This raises

Traceback (most recent call last):
  File "/home/developer/.config/JetBrains/PyCharm2020.2/scratches/scratch_4.py", line 158, in <module>
    df.iloc[0, 0] = rhs
  File "/home/developer/PycharmProjects/pandas/pandas/core/indexing.py", line 689, in __setitem__
    iloc._setitem_with_indexer(indexer, value, self.name)
  File "/home/developer/PycharmProjects/pandas/pandas/core/indexing.py", line 1643, in _setitem_with_indexer
    self._setitem_with_indexer_split_path(indexer, value, name)
  File "/home/developer/PycharmProjects/pandas/pandas/core/indexing.py", line 1689, in _setitem_with_indexer_split_path
    raise ValueError(
ValueError: Must have equal len keys and value when setting with an iterable

The case is pretty weird with setting a Series into a single cell. Any idea how we could accomplish this within _setitem_with_indexer_split_path?

@jbrockmendel
Copy link
Member

The case is pretty weird with setting a Series into a single cell. Any idea how we could accomplish this within _setitem_with_indexer_split_path?

so ive mentioned a couple of times an upcoming PR that will make all DataFrame cases go through split_path. To make that work there are two kludgy checks I have to do near the top of the method (also non-kludges #37931 and #37932):

        info_idx = indexer[1]
        pi = indexer[0]

        if com.is_null_slice(info_idx) and is_scalar(value):
            # We can go directly through BlockManager.setitem without worrying
            #  about alignment.
            # TODO: do we need to do some kind of copy_with_setting check?
            self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
            return

        if is_integer(info_idx):
            if is_integer(pi):
                # We need to watch out for case where we are treating a listlike
                #  as a scalar, e.g. test_setitem_iloc_scalar_single for JSONArray

                mgr = self.obj._mgr
                blkno = mgr.blknos[info_idx]
                blkloc = mgr.blklocs[info_idx]
                blk = mgr.blocks[blkno]

                if blk._can_hold_element(value):
                    # NB: we are assuming here that _can_hold_element is accurate
                    # TODO: do we need to do some kind of copy_with_setting check?
                    self.obj._check_is_chained_assignment_possible()
                    blk.setitem_inplace((pi, blkloc), value)
                    self.obj._maybe_update_cacher(clear=True)
                    return

where I've defined Block.setitem_inplace:

    def setitem_inplace(self, indexer, value):
        """
        setitem but only inplace.

        Notes
        -----
        Assumes self is 2D and that indexer is a 2-tuple.
        """
        if lib.is_scalar(value) and not self.is_extension:
            # Convert timedelta/datetime to timedelta64/datetime64
            value = convert_scalar_for_putitemlike(value, self.dtype)

        pi = indexer[0]
        if self.is_extension:
            # TODO(EA2D): not needed with 2D EAs
            self.values[pi] = value
        else:
            blkloc = indexer[1]
            self.values[blkloc, pi] = value

Looks like you're dealing with roughly the same problem that drove me to put in the double is_integer checks.

@phofl
Copy link
Member Author

phofl commented Nov 22, 2020

It is exactly the same problem. If there is not elegant way, we could check if we have this case before setting split path to True. But this makes only sense if your pr does not arrive before 1.2 and this could go in

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 23, 2020
� Conflicts:
�	doc/source/whatsnew/v1.2.0.rst
�	pandas/tests/frame/indexing/test_setitem.py
�	pandas/tests/indexing/test_iloc.py
�	pandas/tests/indexing/test_loc.py
@phofl
Copy link
Member Author

phofl commented Dec 23, 2020

@jbrockmendel Is your PR with always split path still planned?

@jreback
Copy link
Contributor

jreback commented Feb 11, 2021

status here?

@phofl
Copy link
Member Author

phofl commented Feb 13, 2021

Depends, this sends a few more code paths down the split path in indexing, but I don't know if @jbrockmendel has something other in mind here?

� Conflicts:
�	doc/source/whatsnew/v1.3.0.rst
�	pandas/core/indexing.py
�	pandas/tests/frame/indexing/test_setitem.py
�	pandas/tests/indexing/test_iloc.py
# GH#20635
ser = Series(["3", "4"], name="A")
ser.loc[:] = ser.astype("int64").astype(dtype)
expected = Series([3, 4], name="A", dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is doing the opposite of #39163. did we decide to revert part or all of that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since yours is significantly newer I am fine with „closing“ this. Would check if some of the issues are fixed

@simonjayhawkins
Copy link
Member

@phofl what's the status on this?

@simonjayhawkins
Copy link
Member

@phofl closing as stale. reopen when ready

@jbrockmendel
Copy link
Member

@phofl is this worth reviving? AFAICT you were waiting on me to get the always-split-path PR (#40380) in and that is stuck in limbo with a few tests failing locally.

@phofl phofl deleted the 20635 branch April 27, 2023 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves Stale
Projects
None yet
5 participants