Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept for Copy-on-Write implementation #41878

Closed
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
462526e
POC for Copy-on-Write
jorisvandenbossche Apr 29, 2021
41ee2b7
clean-up ArrayManager.copy implementation
jorisvandenbossche Jun 8, 2021
7a8dffc
add some comments / docstrings
jorisvandenbossche Jun 8, 2021
1c964be
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Jun 25, 2021
96b6d71
fix series slice + del operator
jorisvandenbossche Jun 25, 2021
17cedb9
fix pickle and column_setitem with dtype changes
jorisvandenbossche Jun 25, 2021
f4614c2
fix setitem on single column for full slice + update frame tests
jorisvandenbossche Jun 25, 2021
81b09c2
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Jun 26, 2021
7f183de
fix ref tracking in slicing columns
jorisvandenbossche Jun 26, 2021
693bc4f
fix series setitem -> don't update parent cache
jorisvandenbossche Jun 28, 2021
a154591
update indexing tests with new behaviour to get them passing
jorisvandenbossche Jun 28, 2021
71370c4
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Jul 12, 2021
4e785d7
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Aug 4, 2021
6741340
fix new test + skip Series.update tests for now
jorisvandenbossche Aug 4, 2021
c4527e9
remove chained assignment outside indexing tests
jorisvandenbossche Aug 4, 2021
b33aaf2
fix DataFrame.fillna + more chained setitem
jorisvandenbossche Aug 4, 2021
9fdad69
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Aug 7, 2021
f5da03f
reindex/take on columns to use copy-on-write
jorisvandenbossche Aug 8, 2021
c632757
fix some typing issues
jorisvandenbossche Aug 8, 2021
e4a5f33
fix dtype in test for windows
jorisvandenbossche Aug 8, 2021
5efad50
fix dtype in test for windows
jorisvandenbossche Aug 8, 2021
8ea48cc
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Aug 9, 2021
79b7a30
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Oct 11, 2021
cf5c7e2
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Nov 11, 2021
297662a
[ArrayManager] Array version of putmask logic
jorisvandenbossche Nov 11, 2021
a011a06
fixup copy-on-write in putmask
jorisvandenbossche Nov 11, 2021
cc09001
Update BlockManager expected results after recent behaviour changes
jorisvandenbossche Nov 11, 2021
37e7ce0
limit changes to putmask for this PR
jorisvandenbossche Nov 22, 2021
e34b9b2
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Nov 22, 2021
64377e0
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Nov 26, 2021
0f28095
address feedback
jorisvandenbossche Nov 26, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -4930,7 +4930,7 @@ def rename(
index: Renamer | None = None,
columns: Renamer | None = None,
axis: Axis | None = None,
copy: bool = True,
copy: bool | None = None,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copy=None is what changes DataFrame.rename to use a shallow copy with CoW instead of doing a full copy for the ArrayManager.
This is eventually passed to self.copy(deep=copy), and deep=None is used to signal a shallow copy for ArrayManager while for now preserving the deep copy for BlockManager.

inplace: bool = False,
level: Level | None = None,
errors: str = "ignore",
Expand Down Expand Up @@ -5764,7 +5764,7 @@ class max type
if inplace:
new_obj = self
else:
new_obj = self.copy()
new_obj = self.copy(deep=None)

new_index = ibase.default_index(len(new_obj))
if level is not None:
Expand Down
6 changes: 4 additions & 2 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -992,7 +992,7 @@ def rename(
index: Renamer | None = None,
columns: Renamer | None = None,
axis: Axis | None = None,
copy: bool_t = True,
copy: bool_t | None = None,
inplace: bool_t = False,
level: Level | None = None,
errors: str = "ignore",
Expand Down Expand Up @@ -3911,6 +3911,8 @@ def _check_setitem_copy(self, stacklevel=4, t="setting", force=False):
df.iloc[0:5]['group'] = 'a'

"""
if isinstance(self._mgr, (ArrayManager, SingleArrayManager)):
return
# return early if the check is not needed
if not (force or self._is_copy):
return
Expand Down Expand Up @@ -5833,7 +5835,7 @@ def astype(
return result

@final
def copy(self: FrameOrSeries, deep: bool_t = True) -> FrameOrSeries:
def copy(self: FrameOrSeries, deep: bool_t | None = True) -> FrameOrSeries:
"""
Make a copy of this object's indices and data.

Expand Down
6 changes: 6 additions & 0 deletions pandas/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -1869,6 +1869,12 @@ def _setitem_single_column(self, loc: int, value, plane_indexer):
"""
pi = plane_indexer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#42887 would make for a good precursor

if not getattr(self.obj._mgr, "blocks", False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use explicit isinstance; e.g. at first glance empty blocks would go through here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I should have used hasattr instead of getattr here (which would avoid that empty block case).

pandas.core.indexing currently doesn't import from internals, which I suppose is the reason that I used this implicit check instead of an explicit isinstance (having it as a property on the object, which I think was added in some other (open) PR, might also help for this case)

# ArrayManager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on why this is AM-specific?

self.obj._mgr.column_setitem(loc, plane_indexer, value)
self.obj._clear_item_cache()
return

ser = self.obj._ixs(loc, axis=1)

# perform the equivalent of a setitem on the info axis
Expand Down
63 changes: 56 additions & 7 deletions pandas/core/internals/array_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
Callable,
TypeVar,
)
import weakref

import numpy as np

Expand Down Expand Up @@ -121,11 +122,13 @@ class BaseArrayManager(DataManager):

arrays: list[np.ndarray | ExtensionArray]
_axes: list[Index]
refs: list | None

def __init__(
self,
arrays: list[np.ndarray | ExtensionArray],
axes: list[Index],
refs: list | None,
verify_integrity: bool = True,
):
raise NotImplementedError
Expand Down Expand Up @@ -176,6 +179,15 @@ def is_consolidated(self) -> bool:
def _consolidate_inplace(self) -> None:
pass

def _has_no_reference(self, i: int):
return (self.refs is None or self.refs[i] is None) and weakref.getweakrefcount(
self.arrays[i]
) == 0

def _clear_reference(self, i: int):
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
if self.refs is not None:
self.refs[i] = None

def get_dtypes(self):
return np.array([arr.dtype for arr in self.arrays], dtype="object")

Expand Down Expand Up @@ -516,6 +528,10 @@ def copy(self: T, deep=True) -> T:
-------
BlockManager
"""
if deep is None:
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
# use shallow copy
deep = False

# this preserves the notion of view copying of axes
if deep:
# hit in e.g. tests.io.json.test_pandas
Expand All @@ -529,9 +545,12 @@ def copy_func(ax):

if deep:
new_arrays = [arr.copy() for arr in self.arrays]
refs = None
else:
new_arrays = self.arrays
return type(self)(new_arrays, new_axes)
new_arrays = list(self.arrays)
refs = [weakref.ref(arr) for arr in self.arrays]

return type(self)(new_arrays, new_axes, refs, verify_integrity=False)

def reindex_indexer(
self: T,
Expand Down Expand Up @@ -599,14 +618,18 @@ def _reindex_indexer(

if axis == 1:
new_arrays = []
refs = []
for i in indexer:
if i == -1:
arr = self._make_na_array(
fill_value=fill_value, use_na_proxy=use_na_proxy
)
ref = None
else:
arr = self.arrays[i]
ref = weakref.ref(arr)
new_arrays.append(arr)
refs.append(ref)

else:
validate_indices(indexer, len(self._axes[0]))
Expand All @@ -625,11 +648,12 @@ def _reindex_indexer(
)
for arr in self.arrays
]
refs = None

new_axes = list(self._axes)
new_axes[axis] = new_axis

return type(self)(new_arrays, new_axes, verify_integrity=False)
return type(self)(new_arrays, new_axes, refs, verify_integrity=False)

def take(self: T, indexer, axis: int = 1, verify: bool = True) -> T:
"""
Expand Down Expand Up @@ -693,12 +717,14 @@ def __init__(
self,
arrays: list[np.ndarray | ExtensionArray],
axes: list[Index],
refs: list | None = None,
verify_integrity: bool = True,
):
# Note: we are storing the axes in "_axes" in the (row, columns) order
# which contrasts the order how it is stored in BlockManager
self._axes = axes
self.arrays = arrays
self.refs = refs

if verify_integrity:
self._axes = [ensure_index(ax) for ax in axes]
Expand Down Expand Up @@ -768,15 +794,19 @@ def get_slice(self, slobj: slice, axis: int = 0) -> ArrayManager:

new_axes = list(self._axes)
new_axes[axis] = new_axes[axis]._getitem_slice(slobj)
# TODO possible optimization is to have `refs` to be a weakref to the
# full ArrayManager to indicate it's referencing
refs = [weakref.ref(arr) for arr in self.arrays]

return type(self)(arrays, new_axes, verify_integrity=False)
return type(self)(arrays, new_axes, refs, verify_integrity=False)

def iget(self, i: int) -> SingleArrayManager:
"""
Return the data as a SingleArrayManager.
"""
values = self.arrays[i]
return SingleArrayManager([values], [self._axes[0]])
ref = weakref.ref(values)
return SingleArrayManager([values], [self._axes[0]], [ref])

def iget_values(self, i: int) -> ArrayLike:
"""
Expand Down Expand Up @@ -882,6 +912,8 @@ def insert(self, loc: int, item: Hashable, value: ArrayLike) -> None:
# TODO is this copy needed?
arrays = self.arrays.copy()
arrays.insert(loc, value)
if self.refs is not None:
self.refs.insert(loc, None)

self.arrays = arrays
self._axes[1] = new_axis
Expand All @@ -897,6 +929,15 @@ def idelete(self, indexer):
self._axes = [self._axes[0], self._axes[1][to_keep]]
return self

def column_setitem(self, loc: int, idx: int | slice | np.ndarray, value):
if self._has_no_reference(loc):
self.arrays[loc][idx] = value
else:
arr = self.arrays[loc].copy()
arr[idx] = value
self.arrays[loc] = arr
self._clear_reference(loc)

# --------------------------------------------------------------------
# Array-wise Operation

Expand Down Expand Up @@ -1179,10 +1220,12 @@ def __init__(
self,
arrays: list[np.ndarray | ExtensionArray],
axes: list[Index],
refs: list | None = None,
verify_integrity: bool = True,
):
self._axes = axes
self.arrays = arrays
self.refs = refs

if verify_integrity:
assert len(axes) == 1
Expand Down Expand Up @@ -1286,8 +1329,14 @@ def apply(self, func, **kwargs):
new_array = getattr(self.array, func)(**kwargs)
return type(self)([new_array], self._axes)

def setitem(self, indexer, value):
return self.apply_with_block("setitem", indexer=indexer, value=value)
def setitem(self, indexer, value, inplace=False):
if not self._has_no_reference(0):
self.arrays[0] = self.arrays[0].copy()
self._clear_reference(0)
if inplace:
self.arrays[0][indexer] = value
else:
return self.apply_with_block("setitem", indexer=indexer, value=value)

def idelete(self, indexer) -> SingleArrayManager:
"""
Expand Down
10 changes: 8 additions & 2 deletions pandas/core/internals/managers.py
Original file line number Diff line number Diff line change
Expand Up @@ -351,8 +351,12 @@ def where(self: T, other, cond, align: bool, errors: str) -> T:
errors=errors,
)

def setitem(self: T, indexer, value) -> T:
return self.apply("setitem", indexer=indexer, value=value)
def setitem(self: T, indexer, value, inplace=False) -> T:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this change necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow the change that was needed in the ArrayManager (to ensure every mutation of the values happens in the manager), see #41879 for this change as a separated pre-cursor PR

if inplace:
assert self.ndim == 1
self._block.values[indexer] = value
else:
return self.apply("setitem", indexer=indexer, value=value)

def putmask(self, mask, new, align: bool = True):

Expand Down Expand Up @@ -562,6 +566,8 @@ def copy(self: T, deep=True) -> T:
-------
BlockManager
"""
if deep is None:
deep = True
# this preserves the notion of view copying of axes
if deep:
# hit in e.g. tests.io.json.test_pandas
Expand Down
5 changes: 2 additions & 3 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -1064,10 +1064,9 @@ def __setitem__(self, key, value) -> None:
try:
self._set_with_engine(key, value)
except (KeyError, ValueError):
values = self._values
if is_integer(key) and self.index.inferred_type != "integer":
# positional setter
values[key] = value
self._mgr.setitem(key, value, inplace=True)
else:
# GH#12862 adding a new key to the Series
self.loc[key] = value
Expand Down Expand Up @@ -1099,7 +1098,7 @@ def _set_with_engine(self, key, value) -> None:
# error: Argument 1 to "validate_numeric_casting" has incompatible type
# "Union[dtype, ExtensionDtype]"; expected "dtype"
validate_numeric_casting(self.dtype, value) # type: ignore[arg-type]
self._values[loc] = value
self._mgr.setitem(loc, value, inplace=True)

def _set_with(self, key, value):
# other: fancy integer or otherwise
Expand Down
Loading