ENH: Improve performance for df.setitem with list-like indexers #38148

phofl · 2020-11-29T14:07:15Z

closes BUG: df.__setitem__ can be 10x slower than pd.concat(..., axis=1) #37954
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Reindexing the Block Manager improves the performance significantly. I hope I have not missed anything, concerning the reindexing of the blocks.
Time spent in _ensure_listlike_indexer is pretty low now.
timeit result for the ops methods:

In [20]: %timeit setitem(x, x_col, df)
1.09 ms ± 9.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [21]: %timeit concat(x, x_col, df)
293 µs ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Should we add tests here? I have added an asv to capture this case.

cc @jbrockmendel

jreback

wow, simplification and pref boost

asv_bench/benchmarks/indexing.py

jreback · 2020-11-29T15:56:42Z

pandas/core/indexing.py

-                        self.obj[k] = value[i]
-                    else:
-                        self.obj[k] = value
+            keys = self.obj.columns.tolist()


can you create the keys in the same order whether they are in the obj or not (e.g. combine L666 and 667)

do we have a test for this ordering?

Union should do the trick.

At least the test added with #37964 covers this

doc/source/whatsnew/v1.2.0.rst

pandas/core/indexing.py

jreback · 2020-11-29T19:08:24Z

lgtm can merge on green.

phofl · 2020-11-29T21:39:35Z

@jreback greenish. Failure unrelated

jreback · 2020-11-29T21:52:47Z

thanks @phofl

jbrockmendel · 2020-11-30T00:01:53Z

@phofl did you determine that this isn't making a copy?

phofl · 2020-11-30T00:13:31Z

Not with a test, but I tested this in code.

x = self.obj
self.obj._mgr = self.obj._mgr.reindex_axis(keys, 0)

This also changes x, so that is not a copy, isn't it?

Should we add a test for this?

jbrockmendel · 2020-11-30T00:20:54Z

This also changes x, so that is not a copy, isn't it?

Assuming homogeneous dtype for now, what im asking for is:

values = self.obj.values
self.obj._mgr = self.obj._mgr.reindex_axis(keys, 0)
new_values = self.obj.values

assert new_values is values   # <-- will raise if we made a copy

jreback · 2020-11-30T00:29:38Z

i am not sure this is possible w/o a copy

phofl · 2020-11-30T00:32:02Z

This does make a copy. Did not know, that we have to compare values instead of the object itself.

phofl · 2020-11-30T00:34:15Z

Actually this raised previously too, so at least not a regression

jbrockmendel · 2020-11-30T00:48:12Z

Did not know, that we have to compare values instead of the object itself.

Note: the example above I assumed homogeneous dtypes. More generally, we need to have obj._mgr.blocks[n].values be unchanged for each pre-existing n.

i am not sure this is possible w/o a copy

A few things need to happen for it to be possible:

add consolidate: bool = True keyword to BlockManager.reindex_axis that then passes consolidate=consolidate to reindex_indexer
add only_slice: bool = False to reindex_axis and reindex_indexer so it can be passed to _slice_take_blocks_ax0
the reindex_axis call in this PR passes consolidate=False, only_slice=True

simonjayhawkins · 2020-11-30T09:11:54Z

@meeseeksdev backport 1.1.x

…em__ with list-like indexers

simonjayhawkins · 2020-11-30T10:06:23Z

#38181 (comment)

@phofl #38148 (comment) was this necessary? do we know what commit caused the performance regression. the existing code on 1.1.x looks much simpler to what was replaced on master.

If we don't backport this, we will need to move the release note on master

jorisvandenbossche · 2020-11-30T11:23:50Z

As @jbrockmendel noted, this can change the copy/view semantics in certain cases.

One example:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 4, 6]})
# get one column as a view of df
s = df['a']
# add columns with list-like indexer
df[['c', 'd']] = np.array([[.1, .2], [.3, .4], [.4, .5]])
# edit in place the first column to check view semantics
df.iloc[0, 0] = 100

on master this gives:

In [8]: df
Out[8]: 
     a  b    c    d
0  100  4  0.1  0.2
1    2  4  0.3  0.4
2    3  6  0.4  0.5

In [9]: s
Out[9]: 
0    1
1    2
2    3
Name: a, dtype: int64

where "a" and "b" columns were now copied in the df[['c', 'd']] = ... operation (shown by s not being updated).

While on pandas 1.1.4, this copy didn't happen, and the series was actually updated:

In [6]: df
Out[6]: 
     a  b    c    d
0  100  4  0.1  0.2
1    2  4  0.3  0.4
2    3  6  0.4  0.5

In [7]: s
Out[7]: 
0    100
1      2
2      3
Name: a, dtype: int64

This of course closely relates to the recent discussions about improving the copy/view semantics (as currently those semantics are both not clear and largely untested).

(now the above is certainly a specific example to trigger the issue. In many cases we would also do a copy, eg if there would already have been float columns present in the example df)

…andas-dev#38148)

jbrockmendel · 2020-12-01T04:04:41Z

what if instead of calling reindex_axis we called reindex_indexer and passed indexer=slice(len(self.obj.columns))?

…dexers (#38148)" This reverts commit 2f41109.

…dexers (#38148)" (#38208) This reverts commit 2f41109.

Fix performance problems for df.__setitem__ with lislike indexers

f953ff4

phofl added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Nov 29, 2020

Improve code

322feb5

jreback requested changes Nov 29, 2020

View reviewed changes

jreback added this to the 1.2 milestone Nov 29, 2020

jreback requested changes Nov 29, 2020

View reviewed changes

doc/source/whatsnew/v1.2.0.rst Outdated Show resolved Hide resolved

jreback modified the milestones: 1.2, 1.1.5 Nov 29, 2020

jbrockmendel reviewed Nov 29, 2020

View reviewed changes

pandas/core/indexing.py Outdated Show resolved Hide resolved

phofl and others added 2 commits November 29, 2020 17:17

Adress review comments

07b1c52

Merge branch 'master' into 37954

d4b76e5

jreback approved these changes Nov 29, 2020

View reviewed changes

jreback merged commit 2f41109 into pandas-dev:master Nov 29, 2020

phofl deleted the 37954 branch November 29, 2020 21:54

This comment has been minimized.

Sign in to view

lumberbot-app bot added the Still Needs Manual Backport label Nov 30, 2020

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Nov 30, 2020

Backport PR pandas-dev#38148: ENH: Improve performance for df.__setit…

335d100

…em__ with list-like indexers

simonjayhawkins mentioned this pull request Nov 30, 2020

Backport PR #38148: ENH: Improve performance for df.__setitem__ with list-like indexers #38181

Closed

simonjayhawkins removed the Still Needs Manual Backport label Nov 30, 2020

jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Nov 30, 2020

ENH: Improve performance for df.__setitem__ with list-like indexers (p…

80b40d9

…andas-dev#38148)

simonjayhawkins mentioned this pull request Nov 30, 2020

RLS: 1.2 #37784

Closed

jbrockmendel mentioned this pull request Dec 1, 2020

Retain views with listlike indexers setitem #38204

Merged

1 task

simonjayhawkins added a commit that referenced this pull request Dec 1, 2020

Revert "ENH: Improve performance for df.__setitem__ with list-like in…

5a84497

…dexers (#38148)" This reverts commit 2f41109.

simonjayhawkins mentioned this pull request Dec 1, 2020

Revert "ENH: Improve performance for df.__setitem__ with list-like indexers" #38208

Merged

simonjayhawkins added a commit that referenced this pull request Dec 1, 2020

Revert "ENH: Improve performance for df.__setitem__ with list-like in…

c2018c1

…dexers (#38148)" (#38208) This reverts commit 2f41109.

simonjayhawkins removed this from the 1.1.5 milestone Dec 1, 2020

simonjayhawkins mentioned this pull request Dec 1, 2020

BUG: df.__setitem__ can be 10x slower than pd.concat(..., axis=1) #37954

Closed

3 tasks

jbrockmendel added a commit to jbrockmendel/pandas that referenced this pull request Dec 1, 2020

port whatsnew, asv from pandas-dev#38148

2235f75

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Improve performance for df.setitem with list-like indexers #38148

ENH: Improve performance for df.setitem with list-like indexers #38148

phofl commented Nov 29, 2020 •

edited

Loading

jreback left a comment

jreback Nov 29, 2020

jreback Nov 29, 2020

phofl Nov 29, 2020

jreback commented Nov 29, 2020

phofl commented Nov 29, 2020

jreback commented Nov 29, 2020

jbrockmendel commented Nov 30, 2020

phofl commented Nov 30, 2020 •

edited

Loading

jbrockmendel commented Nov 30, 2020

jreback commented Nov 30, 2020

phofl commented Nov 30, 2020

phofl commented Nov 30, 2020

jbrockmendel commented Nov 30, 2020

simonjayhawkins commented Nov 30, 2020

This comment has been minimized.

simonjayhawkins commented Nov 30, 2020

jorisvandenbossche commented Nov 30, 2020 •

edited

Loading

jbrockmendel commented Dec 1, 2020

ENH: Improve performance for df.__setitem__ with list-like indexers #38148

ENH: Improve performance for df.__setitem__ with list-like indexers #38148

Conversation

phofl commented Nov 29, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

jreback Nov 29, 2020

Choose a reason for hiding this comment

jreback Nov 29, 2020

Choose a reason for hiding this comment

phofl Nov 29, 2020

Choose a reason for hiding this comment

jreback commented Nov 29, 2020

phofl commented Nov 29, 2020

jreback commented Nov 29, 2020

jbrockmendel commented Nov 30, 2020

phofl commented Nov 30, 2020 • edited Loading

jbrockmendel commented Nov 30, 2020

jreback commented Nov 30, 2020

phofl commented Nov 30, 2020

phofl commented Nov 30, 2020

jbrockmendel commented Nov 30, 2020

simonjayhawkins commented Nov 30, 2020

This comment has been minimized.

simonjayhawkins commented Nov 30, 2020

jorisvandenbossche commented Nov 30, 2020 • edited Loading

jbrockmendel commented Dec 1, 2020

ENH: Improve performance for df.setitem with list-like indexers #38148

ENH: Improve performance for df.setitem with list-like indexers #38148

phofl commented Nov 29, 2020 •

edited

Loading

phofl commented Nov 30, 2020 •

edited

Loading

jorisvandenbossche commented Nov 30, 2020 •

edited

Loading