Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.24.0 vs 0.23.4: scalar + DataFrame is 3000x slower #24990

Closed
dycw opened this issue Jan 29, 2019 · 17 comments
Closed

0.24.0 vs 0.23.4: scalar + DataFrame is 3000x slower #24990

dycw opened this issue Jan 29, 2019 · 17 comments
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance

Comments

@dycw
Copy link

dycw commented Jan 29, 2019

Code Sample, a copy-pastable example if possible

First, two conda environments:

name: pandas_timing_0_23

dependencies:
  - python=3.7
  - pandas=0.23
name: pandas_timing_0_24

dependencies:
  - python=3.7
  - pandas=0.24

Now, a generalized script testing all 8 combinations of scalar + ndframe where scalar in [0, 0.0] and ndframe is a Series or DataFrame with 5000 NaNs; combinations reverse the order too.

from itertools import product
from os import system

from pandas import show_versions

if __name__ == "__main__":
    show_versions()
    print()

    for scalar, ndframe in product(
        ["0", "0.0"],
        [
            "pd.Series(np.nan, index=range(5000), dtype=float)",
            "pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float)",
        ],
    ):
        for left, right in [(scalar, ndframe), (ndframe, scalar)]:
            stmt = f"{left} + {right}"
            print(f"Timing {stmt!r}...")
            system(
                f"""python -m timeit --setup='import numpy as np; import pandas as pd;' {stmt!r}"""
            )
            print()

Problem description

A table of timings:

+-----------+-----------+--------+----+--------+----+----------+--------------+
|           |           | 0.23.4 |    | 0.24.0 |    | % slower | times slower |
+-----------+-----------+--------+----+--------+----+----------+--------------+
| int       | Series    |    102 | us |    140 | us |     37.3 |              |
| Series    | int       |   98.7 | us |    140 | us |     41.8 |              |
| int       | DataFrame |    188 | us |    592 | ms |          |         3149 |
| DataFrame | int       |    187 | us |    588 | ms |          |         3144 |
| float     | Series    |     97 | us |    140 | us |     44.3 |              |
| Series    | float     |    102 | us |    138 | us |     35.3 |              |
| float     | DataFrame |    176 | us |    609 | ms |          |         3460 |
| DataFrame | float     |    185 | us |    591 | ms |          |         3195 |
+-----------+-----------+--------+----+--------+----+----------+--------------+

These are collated from the following:

0.23.4
Timing '0 + pd.Series(np.nan, index=range(5000), dtype=float)'...
5000 loops, best of 5: 102 usec per loop

Timing 'pd.Series(np.nan, index=range(5000), dtype=float) + 0'...
5000 loops, best of 5: 98.7 usec per loop

Timing '0 + pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float)'...
2000 loops, best of 5: 188 usec per loop

Timing 'pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float) + 0'...
2000 loops, best of 5: 187 usec per loop

Timing '0.0 + pd.Series(np.nan, index=range(5000), dtype=float)'...
5000 loops, best of 5: 97 usec per loop

Timing 'pd.Series(np.nan, index=range(5000), dtype=float) + 0.0'...
5000 loops, best of 5: 102 usec per loop

Timing '0.0 + pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float)'...
2000 loops, best of 5: 176 usec per loop

Timing 'pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float) + 0.0'...
2000 loops, best of 5: 185 usec per loop
0.24.0
Timing '0 + pd.Series(np.nan, index=range(5000), dtype=float)'...
2000 loops, best of 5: 140 usec per loop

Timing 'pd.Series(np.nan, index=range(5000), dtype=float) + 0'...
2000 loops, best of 5: 140 usec per loop

Timing '0 + pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float)'...
1 loop, best of 5: 592 msec per loop

Timing 'pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float) + 0'...
1 loop, best of 5: 588 msec per loop

Timing '0.0 + pd.Series(np.nan, index=range(5000), dtype=float)'...
2000 loops, best of 5: 140 usec per loop

Timing 'pd.Series(np.nan, index=range(5000), dtype=float) + 0.0'...
2000 loops, best of 5: 138 usec per loop

Timing '0.0 + pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float)'...
1 loop, best of 5: 609 msec per loop

Timing 'pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float) + 0.0'...
1 loop, best of 5: 591 msec per loop

If the issue has not been resolved there, go ahead and file it in the issue tracker.

Expected Output

Output of pd.show_versions()

0.23.4

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

0.24.0

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@TomAugspurger
Copy link
Contributor

@dycw is "DataFrame ops with scalars are slower" a fair summary of the issue? Anything else from your post that's important?

cc @jbrockmendel. Is this related to the fix for the broadcasting? We're spending a lot of time in dispatch_to_series, via _combine_const.

@TomAugspurger TomAugspurger added Performance Memory or execution speed performance Numeric Operations Arithmetic, Comparison, and Logical operations labels Jan 29, 2019
@dycw
Copy link
Author

dycw commented Jan 29, 2019

@TomAugspurger Yes, I believe all ops should be impacted, beyond add. My test also showed regression with Series too, albeit not orders of magnitude.

@TomAugspurger
Copy link
Contributor

My test also showed regression with Series too, albeit not orders of magnitude.

We'll see what profiling shows, but that's likely a different issue. 0.24.0 changed more operations to operation column-wise, so this slowdown scales with the number of columns, which wouldn't affect series.

@jbrockmendel
Copy link
Member

Is this related to the fix for the broadcasting? We're spending a lot of time in dispatch_to_series, via _combine_const.

Yes, we are doing the operation column-by-column, which definitely has a perf hit in few-row/many-column cases like this. Locally I'm only seeing about a 10x difference, can't speak to the 3000x.

side-note: instantiating the DataFrame pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float) is about 10% slower in 0.24.0.

In the short-medium run, operating column-wise is the only way we could find to make Series and DataFrame behavior consistent.

medium-long run I think the way to address this perf issue is to dispatch to blocks instead of columns. This is a big part of why I want EA to be able to handle 2D arrays.

@TomAugspurger
Copy link
Contributor

Are EAs being 1-d prohibiting blockwise application of these ops? I would think those are orthogonal.

Each EA column would be done on its own. But a frame with 1,000 float columns and one EA would end up with two calls to __add__, one for each block.

@jbrockmendel
Copy link
Member

Are EAs being 1-d prohibiting blockwise application of these ops? I would think those are orthogonal.

I could have made this clearer.

The functions in core.ops that define arithmetic/comparison ops for Series are pretty easy to adapt to the 2D case (in fact, some are already dim-agnostic). The path I have in mind is:

  • Allow 2D EAs
  • Blocks currently backed by numpy arrays instead become backed by PandasArrays
  • define arithmetic/comparison ops directly on PandasArray
  • DataFrame and Series both dispatch to underlying block(s), which in turn dispatch to their EAs.
  • (side-note) Index also gets its arithmetic/comparison ops from the EA that backs it, closing a whole mess of issues/xfails.

@TomAugspurger
Copy link
Contributor

Allow 2D EAs

First, assume a can opener :) (fellow economists should get that).

More seriously, I've been thinking of ways we could opt pandas into 2D arrays, without putting that complexity on users. Nothing concrete yet though. In particular, it's not clear to me that a 2D EA isn't just re-inventing Block. I'd be interested in seeing if we can get ops working blockwise with the current mix of ndarray-backed Blocks and EA-backed ops (though this isn't a high priority for me right now).

@jbrockmendel
Copy link
Member

More seriously, I've been thinking of ways we could opt pandas into 2D arrays, without putting that complexity on users.

Maybe we should discuss this in a dedicated Issue? I like this as a potential compromise on the 1D vs drop-in-for-ndarray tradeoff. It would make it feasible for me to put together a proof of concept for the block-based arithmetic.

just re-inventing Block

Block would be a much simpler beast if it a) didn't have to carry around mgr_locs and b) didn't have to worry about 1D blocks inside a DataFrame.

@gfyoung
Copy link
Member

gfyoung commented Jan 29, 2019

First, assume a can opener :) (fellow economists should get that).

@TomAugspurger : Had to interject, but I approve 🙂

@jreback
Copy link
Contributor

jreback commented Jan 29, 2019

I agree with @TomAugspurger here. This has been a big box of works (more than a can :>) for quite some time. Blocks are rather efficient for some operations but have some costs:

  • assembling the blocks can be pretty expense (think initial copy from 1-d arrays)
  • shape mutating causes copies
  • setting is overly complicated, now blocks need to be split
  • adds contributor overhead because of the complexity
  • blocks implement lazy consolidation to mitigate some of the above costs

So its not a simple, 'just use 2-D EA'. as you can see some obvious benefits, but there is a really long tail of hidden costs.

I have basically switched my view over the years from being pro-blocks (for getting perf benefits), to pro-columns because of the simplicity, sure we do have some perf costs but IMHO this is easily out-weighted by reduced complexity.

Anyhow, let's open a dedicated issue about this.

@jorisvandenbossche
Copy link
Member

I profiled the case of

df = pd.DataFrame(np.nan, index=range(5000), columns=range(5000), dtype=float)
%timeit df + 1

So it is using 5000x5000 dataframe instead of 1x5000 (still a wide dataframe but a bit more realistic case I think).

On master this takes around 2 seconds which is 20x slower as before the change (10x with the original timeit in the OP, but exclusing the dataframe creation makes it more pronounced).

With a few relatively simple optimizations, I got this down from 20x slowdown to 5x slowdown.

Those include:

  • use a more efficient way to loop over the columns instead of using df.iloc[:, i] (using BlockManager.get, we should probably write a helper method for this)
  • avoid the overhead of creating a Series for each column, and then again a Series from the result of the op (this assumes of course that the operation is defined is properly defined on the array level)
  • avoid the (small) overhead of the recreation of the resulting DataFrame from a dict of Series instead of a list of arrays (this functionality is kind of hidden in DataFrame._from_arrays).

Additionally, the recreation of the array spends quite some time in sanitizing the array and checking the block types, which should in principle not be necessary.

Of course this was a very simplified case (I only looked at the operation with a scalar, a case without nans, etc), but I think it at least shows that if we care about this performance issue for wide dataframes, it can be substantially improved quite a bit.

@TomAugspurger
Copy link
Contributor

@jbrockmendel 2D EAs should fix this regression, right?

@jbrockmendel
Copy link
Member

jbrockmendel commented Aug 9, 2019

yes

(well, 2D EA will make it feasible to fix in a follow-up)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 9, 2019 via email

@jbrockmendel
Copy link
Member

At the current pace, no, I don't expect 2D EA to be in master by the end of September. If everyone gets on board, that could change quickly.

Working on plan B to get a perf fix in more promptly

@jbrockmendel
Copy link
Member

Benchmarking #29853 against master with the examples from the OP:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float)

%timeit df + 0
911 ms ± 17.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   # <-- master
147 µs ± 7.82 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- #29853

%timeit df + 0.0
891 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    # <-- master
139 µs ± 1.36 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)   # <-- #29853

Speedups of 6197x and 6410x.

@jbrockmendel
Copy link
Member

Closed by #29853

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

6 participants