Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF/API: DataFrame.__setitem__ never operate in-place #39510

Merged
merged 35 commits into from
Mar 16, 2021

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

xref #38896 (comment)

@jreback jreback added the Indexing Related to indexing on series/frames, not to indexes themselves label Feb 2, 2021
else:
if isinstance(value, DataFrame):
check_key_length(self.columns, key, value)
for k1, k2 in zip(key, value.columns):
self[k1] = value[k2]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really like to move this stuff out of here into another module, but i guess that's for another day

raise NotImplementedError

if np.shape(value)[-1] != len(ilocs):
raise ValueError("Columns must be same length as key")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are all of the branches covered? this is a lot of new code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not the NotImplementedError on L3276; not sure what to do there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the rest is covered? what about the previous paths this was taking, can they be removed now? (e.g. they are uncovered)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with a test that hits the two raise ValueError("Columns must be same length as key") lines. everything other than the NotImplementedError is reached

@jbrockmendel
Copy link
Member Author

getting this and #39163 in will make it easier to keep ArrayManager behavior in sync with non-ArrayManager

@jbrockmendel
Copy link
Member Author

@jreback gentle ping; some simplification in iloc.__setitem__ becomes feasible after this, some of which I think may make #39578 easier

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give some explanation what this PR is exactly changing? (it's hard to see that from the code changed in frame.py)

expected["c"] = expected["c"].astype(arr.dtype)
expected["d"] = expected["d"].astype(arr.dtype)
assert expected["c"].dtype == arr.dtype
assert expected["d"].dtype == arr.dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changed here? (I wouldn't expect the above test to change behaviour related to this PR, as it's adding new columns)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this was needed for the 32bit builds

@jbrockmendel
Copy link
Member Author

Can you give some explanation what this PR is exactly changing? (it's hard to see that from the code changed in frame.py)

There are cases where DataFrame.__setitem__ currently ends up dispatching to iloc.__setitem__ (which in turn ends up iterating over columns). This does the iterating-over-columns directly in the DataFrame method, avoiding the iloc dispatch. This resolves the trouble in iloc[:, foo] of having to tell whether to write into existing data or insert a new array.

@phofl
Copy link
Member

phofl commented Feb 20, 2021

I think we talked about this a few weeks back. I prepared a few tests back then but did not get around to push this forward.

The tests are checking, if we are really not operating inplace. Do you think it is worth adding them?

class TestDataFrameSetItemNotInplace:
    def test_setitem_duplicate_columns(self):
        cols = ["A", "B"] * 2
        df = DataFrame(0.0, index=[0], columns=cols)
        df_copy = df.copy()
        df_view = df[:]
        df["B"] = (2, 5)

        expected = DataFrame([[0.0, 2, 0.0, 5]], columns=cols)
        tm.assert_frame_equal(df_view, df_copy)
        tm.assert_frame_equal(df, expected)

    @pytest.mark.xfail(reason="Setitem with same dtype still changing inplace")
    @pytest.mark.parametrize("value", [1, np.array([[1], [1]]), [[1], [1]]])
    def test_setitem_same_dtype(self, value):
        cols = ["A", "B"]
        df = DataFrame(0, index=[0, 1], columns=cols)
        df_copy = df.copy()
        df_view = df[:]
        df[["B"]] = value

        expected = DataFrame([[0, 1], [0, 1]], columns=cols)
        tm.assert_frame_equal(df, expected)
        tm.assert_frame_equal(df_view, df_copy)

    @pytest.mark.parametrize("value", [1.0, np.array([[1.0], [1.0]]), [[1.0], [1.0]]])
    def test_setitem_listlike_key_scalar_value(self, value):
        cols = ["A", "B"]
        df = DataFrame(0, index=[0, 1], columns=cols)
        df_copy = df.copy()
        df_view = df[:]
        df[["B"]] = value

        expected = DataFrame([[0, 1.0], [0, 1.0]], columns=cols)
        tm.assert_frame_equal(df_view, df_copy)
        tm.assert_frame_equal(df, expected)

@jbrockmendel
Copy link
Member Author

Do you think it is worth adding them?

absolutely. will check that these pass on the branch and then update. thanks

@jbrockmendel
Copy link
Member Author

ping

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify the whatsnew a bit more about what actually changed / what the impact is for the user? (this is not really a bug fix ..)

@jbrockmendel
Copy link
Member Author

Can you clarify the whatsnew a bit more about what actually changed / what the impact is for the user? (this is not really a bug fix ..)

The only part to call a bugfix is the int32 changes (see affected tests). The changed exception messages hardly seem worth noting

@jorisvandenbossche
Copy link
Member

The whatsnew note now says "some cases". Eg could make that more specific?

@jorisvandenbossche
Copy link
Member

To try to clarify what I mean. I didn't have time to review the code of this PR in enough detail to know what it changes, but I would still expect to get an idea about that based on the (whatsnew) description. But now neither the PR description nor the whatsnew gives me a clear idea of the (potential) changes.

@jbrockmendel
Copy link
Member Author

To try to clarify what I mean. I didn't have time to review the code of this PR in enough detail to know what it changes, but I would still expect to get an idea about that based on the (whatsnew) description. But now neither the PR description nor the whatsnew gives me a clear idea of the (potential) changes.

ATM DataFrame.__setitem__ calls self.iloc[:, indexer] = value, which has different copy/view semantics than we want for __setitem__ (see #38896 (comment), also linked in the OP). Big picture: this PR is avoiding that self.iloc dispatch.

Luckily some of the tests suggested by @phofl fails in master:

# test_setitem_listlike_key_scalar_value_not_inplace with value = 1.0
cols = ["A", "B"]
df = DataFrame(0, index=[0, 1], columns=cols)
df_copy = df.copy()
df_view = df[:]
df[["B"]] = 1.0

expected = DataFrame([[0, 1.0], [0, 1.0]], columns=cols)
tm.assert_frame_equal(df_view, df_copy)  # <-- raises in master, b/c df["B"]._values is written _into_
tm.assert_frame_equal(df, expected)

# test_setitem_duplicate_columns_not_inplace
cols = ["A", "B"] * 2
df = DataFrame(0.0, index=[0], columns=cols)
df_copy = df.copy()
df_view = df[:]
df["B"] = (2, 5)

expected = DataFrame([[0.0, 2, 0.0, 5]], columns=cols)
tm.assert_frame_equal(df_view, df_copy)  # <-- raises in master, b/c df["B"]._values is written _into_
tm.assert_frame_equal(df, expected)

@jorisvandenbossche
Copy link
Member

OK, so a consequence is that the resulting dtype can change compared to master when you are setting multiple columns. For example (using this branch):

In [4]: df = pd.DataFrame(np.random.randn(3, 5), columns=['a', 'b', 'c', 'd', 'e'])

In [5]: df[["b", "c"]] = 1

In [6]: df
Out[6]: 
          a  b  c         d         e
0  0.576266  1  1  0.115430  0.670362
1 -0.409783  1  1  1.824356  1.702370
2  1.168233  1  1 -0.708717  1.555301

In [7]: df.dtypes
Out[7]: 
a    float64
b      int64
c      int64
d    float64
e    float64
dtype: object

On master, the above preserves the all-float dtypes. Do you know if that is specific to that case? (multiple columns, and setting with a scalar) From quick testing, it seems that other cases (like setting multiple columns with a dataframe, or array, .. ) didn't try to preserve the dtype if possible.
But where there other cases also going through the self.iloc[:, indexer] = value path?

@jbrockmendel
Copy link
Member Author

But where there other cases also going through the self.iloc[:, indexer] = value path?

The only place where we go through self.iloc[:, indexer] = value is in _setitem_array, which is reached in two ways (note: a bunch of other cases are caught before going through _setitem_array, so these conditions are not sufficient to ensure we go through iloc)

if isinstance(key, (Series, np.ndarray, list, Index)):
elif is_list_like(value) and 1 < len(self.columns.get_indexer_for([key])) == len(value):

which is pretty well summed up by "multiple columns"

@jorisvandenbossche
Copy link
Member

I would personally then add a small subsection to the "Notable bug fixes" part showing the example of setting multiple columns now having different dtypes as result (there is already a subsection for iloc/loc changes (actually trying inplace instead of replacing), which is quite related (but opposite) change.

self.iloc[:, indexer] = value
self._iset_not_inplace(key, value)

def _iset_not_inplace(self, key, value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment (or informal docstring) here stating briefly which cases this covers? (eg I think only for listlike key and value?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, will do

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, LMK if you find this informative

I think comments are addressed, so this is ready for another look (once the CI runs)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you merge master

# hypothetically would return obj.iloc[:, i]
if isinstance(obj, np.ndarray):
return obj[..., i]
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the future would be really nice to move lots of this elsewhere (e.g. indexing.py or similar)

@jbrockmendel
Copy link
Member Author

rebased, green ex two issues on master being addressed elsewhere

@jreback jreback merged commit fefd999 into pandas-dev:master Mar 16, 2021
@jreback
Copy link
Contributor

jreback commented Mar 16, 2021

thanks @jbrockmendel nice!

@jbrockmendel jbrockmendel deleted the setitem-frame-no-defer branch March 16, 2021 17:15
@simonjayhawkins
Copy link
Member

There is a perf regression between 1.2.5 and 1.3.x that points to this PR that is not shown on the asv dashboard https://pandas.pydata.org/speed/pandas/#indexing.InsertColumns.time_assign_list_like_with_setitem?python=3.8&Cython=0.29.21&jinja2=

(pandas-dev) simon@T3630:~/pandas/asv_bench (master)$ asv compare v1.2.5 1.3.x |grep time_assign_list_like_with_setitem
x     2.38±0.03ms       27.4±0.4ms    11.48  indexing.InsertColumns.time_assign_list_like_with_setitem

profiling gives

(pandas-dev) simon@T3630:~/pandas/asv_bench (master)$ asv profile indexing.InsertColumns.time_assign_list_like_with_setitem fefd9991658c4717fa2e6b37328be81158be4192|head -n 30
· Discovering benchmarks
· Profile data does not already exist. Running profiler now.
·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
··· Running (indexing.InsertColumns.time_assign_list_like_with_setitem--).
··· ...rtColumns.time_assign_list_like_with_setitem         24.5±0.2ms

Tue Jun 29 11:28:30 2021    /tmp/tmpa23qf3wg

         71979 function calls (70571 primitive calls) in 0.039 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.039    0.039 {built-in method builtins.exec}
        1    0.000    0.000    0.039    0.039 /home/simon/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py:540(method_caller)
        1    0.000    0.000    0.039    0.039 /home/simon/pandas/asv_bench/benchmarks/indexing.py:378(time_assign_list_like_with_setitem)
    101/1    0.000    0.000    0.038    0.038 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/frame.py:3516(__setitem__)
        1    0.000    0.000    0.038    0.038 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/frame.py:3548(_setitem_array)
        1    0.000    0.000    0.038    0.038 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/frame.py:3585(_iset_not_inplace)
      100    0.000    0.000    0.027    0.000 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/frame.py:3703(_set_item)
      100    0.000    0.000    0.025    0.000 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/frame.py:3678(_set_item_mgr)
      100    0.001    0.000    0.024    0.000 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/internals/managers.py:1230(insert)
  401/301    0.003    0.000    0.016    0.000 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/indexes/base.py:334(__new__)
      100    0.001    0.000    0.015    0.000 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/indexes/base.py:5864(insert)
      100    0.000    0.000    0.010    0.000 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/indexes/base.py:5220(get_indexer_for)
  101/100    0.000    0.000    0.009    0.000 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/indexes/base.py:3378(get_indexer)
      101    0.000    0.000    0.005    0.000 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/indexes/base.py:6174(ensure_index)
      101    0.000    0.000    0.005    0.000 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/construction.py:455(sanitize_array)
      100    0.000    0.000    0.004    0.000 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/internals/managers.py:1994(_fast_count_smallints)
 1200/800    0.001    0.000    0.004    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}

before

(pandas-dev) simon@T3630:~/pandas/asv_bench (master)$ asv profile indexing.InsertColumns.time_assign_list_like_with_setitem fefd9991658c4717fa2e6b37328be81158be4192~1|head -n 30
· Discovering benchmarks
·· Uninstalling from conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Building 203f9014 <v1.3.0rc0~819> for conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing 203f9014 <v1.3.0rc0~819> into conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
· Profile data does not already exist. Running profiler now.
·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
··· Running (indexing.InsertColumns.time_assign_list_like_with_setitem--).
··· ...rtColumns.time_assign_list_like_with_setitem        2.31±0.01ms

Tue Jun 29 11:33:24 2021    /tmp/tmp7ioufpog

         2162 function calls (2141 primitive calls) in 0.003 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.003    0.003 {built-in method builtins.exec}
        1    0.000    0.000    0.003    0.003 /home/simon/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/asv/benchmark.py:540(method_caller)
        1    0.000    0.000    0.003    0.003 /home/simon/pandas/asv_bench/benchmarks/indexing.py:378(time_assign_list_like_with_setitem)
        1    0.002    0.002    0.002    0.002 {method 'randn' of 'numpy.random.mtrand.RandomState' objects}
        1    0.000    0.000    0.001    0.001 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/frame.py:3516(__setitem__)
        1    0.000    0.000    0.001    0.001 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/frame.py:3548(_setitem_array)
        1    0.000    0.000    0.001    0.001 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/indexing.py:670(_ensure_listlike_indexer)
        1    0.000    0.000    0.000    0.000 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/indexing.py:1257(_get_listlike_indexer)
      6/5    0.000    0.000    0.000    0.000 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/indexes/base.py:334(__new__)
      2/1    0.000    0.000    0.000    0.000 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/indexes/base.py:2784(union)
      3/2    0.000    0.000    0.000    0.000 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/indexes/base.py:3378(get_indexer)
        1    0.000    0.000    0.000    0.000 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/indexes/base.py:5220(get_indexer_for)
        1    0.000    0.000    0.000    0.000 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/internals/base.py:62(reindex_axis)
      3/2    0.000    0.000    0.000    0.000 /home/simon/pandas/asv_bench/env/11a1c20ede452de2525075dc4a15eb94/lib/python3.8/site-packages/pandas/core/indexes/base.py:3405(_get_indexer)

so spends some time in _iset_not_inplace

Is this expected?

@jbrockmendel
Copy link
Member Author

Is this expected?

With the current implementation, yes. I think a deep(ish) refactor can avoid the perf hit, but it's not a 1-day thing. cc @phofl

BTW is there an option you can pass to the asv profile command like strip_dirs? (ideally just strip everything up through "site-packages/")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants