Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Add Manager.column_setitem to set values into a single column (without intermediate series) #47074

Merged
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 7 additions & 8 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3921,16 +3921,15 @@ def _set_value(
Sets whether or not index/col interpreted as indexers
"""
try:
# setitem will do validation that may raise TypeError,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this comment refer to column_setitem instead of just setitem?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also looks like the new method doesn't actually do the validation this comment refers to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an existing comment (originally a few lines below), but so I suppose this comment was actually already not up to date anymore.
So before this PR the comment was about setitem_inplace, and that also doesn't do any validation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setitem_inplace calls np_can_hold_element, which raises on failure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I see. That is doing validation that can raise the LossySetitemError. So the new column_setitem calls setitem, which will also call np_can_hold_element, but there catching the LossySetitemError and coercing to the target dtype if needed.
That is something that the loc/iloc fallback below otherwise will also do, so I suppose this change is OK (but the comment is then indeed no longer correct, and we also don't need to catch LossySetitemError here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we also don't need to catch LossySetitemError here

yah, possibly also TypeError and ValueError

# ValueError, or LossySetitemError
if takeable:
series = self._ixs(col, axis=1)
loc = index
icol = col
iindex = cast(int, index)
else:
series = self._get_item_cache(col)
loc = self.index.get_loc(index)

# setitem_inplace will do validation that may raise TypeError,
# ValueError, or LossySetitemError
series._mgr.setitem_inplace(loc, value)
icol = self.columns.get_loc(col)
iindex = self.index.get_loc(index)
self._mgr.column_setitem(icol, iindex, value, inplace=True)

except (KeyError, TypeError, ValueError, LossySetitemError):
# set using a non-recursive method & reset the cache
Expand Down
28 changes: 8 additions & 20 deletions pandas/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,6 @@
from pandas.core.indexers import (
check_array_indexer,
is_empty_indexer,
is_exact_shape_match,
is_list_like_indexer,
is_scalar_indexer,
length_of_indexer,
Expand Down Expand Up @@ -1936,42 +1935,31 @@ def _setitem_single_column(self, loc: int, value, plane_indexer):
"""
pi = plane_indexer
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved

ser = self.obj._ixs(loc, axis=1)

# perform the equivalent of a setitem on the info axis
# as we have a null slice or a slice with full bounds
# which means essentially reassign to the columns of a
# multi-dim object
# GH#6149 (null slice), GH#10408 (full bounds)
if com.is_null_slice(pi) or com.is_full_slice(pi, len(self.obj)):
ser = value
self.obj._iset_item(loc, value)
elif (
is_array_like(value)
and is_exact_shape_match(ser, value)
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
and len(value.shape) > 0
and self.obj.shape[0] == value.shape[0]
and not is_empty_indexer(pi)
):
if is_list_like(pi):
ser = value[np.argsort(pi)]
value = value[np.argsort(pi)]
else:
# in case of slice
ser = value[pi]
value = value[pi]
self.obj._iset_item(loc, value)
else:
# set the item, first attempting to operate inplace, then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment still accurate/helpful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not really I think (but also not before this PR). The logic of "first attempt to operate inplace, then falling back to casting if necessary" is contained within the column_setitem call (and before this PR it was contained in the setitem call)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, it was added in 573c063, and from that point of view: I would say the comment is still as accurate as it was when it was added :)

But I think we can also remove it now, it's the general behaviour of loc/iloc to fallback to casting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth discussing the gameplan for after the deprecation inside _setitem_single_column is enforced. What I had in mind was that inside of doing the can_hold_element check as part of take_split_path, we would do a try/except falling back to the split path. Does that idea play nicely with what you're doing here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that enforcing the deprecation would mean to remove the checks for "null/full slice" and so stop using _iset_item to replace the column in those cases, and always use column_setitem ?

# falling back to casting if necessary; see
# _whatsnew_130.notable_bug_fixes.setitem_column_try_inplace

orig_values = ser._values
ser._mgr = ser._mgr.setitem((pi,), value)

if ser._values is orig_values:
# The setitem happened inplace, so the DataFrame's values
# were modified inplace.
return
self.obj._iset_item(loc, ser)
return

# reset the sliced object if unique
self.obj._iset_item(loc, ser)
self.obj._mgr.column_setitem(loc, plane_indexer, value)
self.obj._clear_item_cache()

def _setitem_single_block(self, indexer, value, name: str):
"""
Expand Down
19 changes: 19 additions & 0 deletions pandas/core/internals/array_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -869,6 +869,25 @@ def iset(
self.arrays[mgr_idx] = value_arr
return

def column_setitem(
self, loc: int, idx: int | slice | np.ndarray, value, inplace: bool = False
) -> None:
"""
Set values ("setitem") into a single column (not setting the full column).

This is a method on the ArrayManager level, to avoid creating an
intermediate Series at the DataFrame level (`s = df[loc]; s[idx] = value`)
"""
arr = self.arrays[loc]
# create temporary SingleArrayManager without ref to use setitem implementation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the "without ref" here why you're not using .iget?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like if you used .iget here then the method could be shared between AM/BM?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the "without ref" here why you're not using .iget?

The "without ref" comment is from copy pasting this from the initial CoW PR #41878 (so should probably remove that here).

But that is indeed the reason to not use iget in that PR.

mgr = SingleArrayManager([arr], [self._axes[0]])
if inplace:
mgr.setitem_inplace(idx, value)
else:
new_mgr = mgr.setitem((idx,), value)
# update existing ArrayManager in-place
self.arrays[loc] = new_mgr.arrays[0]
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved

def insert(self, loc: int, item: Hashable, value: ArrayLike) -> None:
"""
Insert item at selected position.
Expand Down
16 changes: 16 additions & 0 deletions pandas/core/internals/managers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1188,6 +1188,22 @@ def _iset_single(
self.blocks = new_blocks
return

def column_setitem(
self, loc: int, idx: int | slice | np.ndarray, value, inplace: bool = False
) -> None:
"""
Set values ("setitem") into a single column (not setting the full column).

This is a method on the BlockManager level, to avoid creating an
intermediate Series at the DataFrame level (`s = df[loc]; s[idx] = value`)
"""
col_mgr = self.iget(loc)
if inplace:
col_mgr.setitem_inplace(idx, value)
else:
new_mgr = col_mgr.setitem((idx,), value)
self.iset(loc, new_mgr._block.values, inplace=True)

def insert(self, loc: int, item: Hashable, value: ArrayLike) -> None:
"""
Insert item at selected position.
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/frame/indexing/test_setitem.py
Original file line number Diff line number Diff line change
Expand Up @@ -1082,7 +1082,7 @@ def test_setitem_partial_column_inplace(self, consolidate, using_array_manager):
tm.assert_numpy_array_equal(zvals, expected.values)
assert np.shares_memory(zvals, df["z"]._values)
if not consolidate:
assert df["z"]._values is zvals
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #47074 (comment) for comment about this removal

assert df["z"]._values.base is zvals.base
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check could pass if both .base attrs are None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch (will update that). Now, I also not fully sure why this change was actually needed, will see if I can figure that out.

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche May 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so before we had this piece of code in indexing.py (that was removed in this PR):

            if ser._values is orig_values:
                # The setitem happened inplace, so the DataFrame's values
                #  were modified inplace.
                return

That ensures that it is actually still an identical array that is stored in df["z"].

Now this goes through column_setitem, where I didn't add such a check, but simply always call iset with the values from the SingleBlockManager. When creating the SingleBlockManager with iget, this creates a view of the values (return self.values[i] in Block.iget), so this is not anymore an identical array, only a viewing array.

Since the only important guarantee is that it shares the same data (which is already tested on the line above), I am inclined to just remove this additional assert for identical arrays.

Alternatively, I could do a similar check in column_setitem.


def test_setitem_duplicate_columns_not_inplace(self):
# GH#39510
Expand Down
4 changes: 3 additions & 1 deletion pandas/tests/indexing/test_partial.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,7 @@ def test_partial_setting(self):
with pytest.raises(IndexError, match=msg):
s.iat[3] = 5.0

def test_partial_setting_frame(self):
def test_partial_setting_frame(self, using_array_manager):
df_orig = DataFrame(
np.arange(6).reshape(3, 2), columns=["A", "B"], dtype="int64"
)
Expand All @@ -279,6 +279,8 @@ def test_partial_setting_frame(self):
df.iloc[4, 2] = 5.0

msg = "index 2 is out of bounds for axis 0 with size 2"
if using_array_manager:
msg = "list index out of range"
with pytest.raises(IndexError, match=msg):
df.iat[4, 2] = 5.0

Expand Down