Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.groupby.sum() is extremely slow when dtype is timedelta64[ns] compared to int64. #20660

Closed
wezzman opened this issue Apr 11, 2018 · 11 comments · Fixed by #55273
Closed
Assignees
Labels
Benchmark Performance (ASV) benchmarks good first issue

Comments

@wezzman
Copy link

wezzman commented Apr 11, 2018

Steps to demonstrate

import numpy as np
import pandas as pd

td = pd.DataFrame(np.random.randint(1000000, 100000000, (10000,100)), index=np.random.randint(200, size=(10000,))).astype('timedelta64[ns]')
i = td.copy(deep=True).astype('int64')

%time temp1 = td.groupby(lambda x: x).sum() # 2.33 s
%time temp2 = i.groupby(lambda x: x).sum() # 15.6 ms

temp2 = temp2.astype('timedelta64[ns]')
assert((temp1 == temp2).values.all())

Problem description

When performing a summation on grouped 'timedelta64[ns]' data, there is a significant performance decrease compared to the same data interpreted as 'int64'.

Possibly related to #18053

Expected Behavior

It is my understanding that internally 'timedelta64[ns]' are just 'int64' and are interpreted as a count of 'ns'. Shouldn't the summation performance be equal in that case?

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.1
setuptools: 39.0.1
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger TomAugspurger added Groupby Performance Memory or execution speed performance Timedelta Timedelta data type labels Apr 12, 2018
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Apr 12, 2018
@TomAugspurger
Copy link
Contributor

Thanks for the report.

cc @WillAyd if you have any pointers on where someone could get started here?

@WillAyd
Copy link
Member

WillAyd commented Apr 12, 2018

Hmm well after a quick profile it appears that the timedelta data does not leverage the Cython function. The problem starts on the below line:

data = data.get_numeric_data(copy=False)

For timedelta that internal method returns an empty tuple, so block aggregation fails and the execution falls back to a much slower group-by-group summation.

Perhaps the easiest fix is to add numeric_only=False to the sum method below:

cls.sum = groupby_function('sum', 'add', np.sum, min_count=0)

Running this locally gets the examples back in line with one another:

In [12]: %timeit td.groupby(lambda x: x).sum()
8.53 ms ± 30.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [13]: %timeit i.groupby(lambda x: x).sum()
9.66 ms ± 30.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It's not entirely clear to me what the overall purpose of numeric_only is so long term maybe we don't even need it, but for the time being the above should work (cc @jreback)

FWIW we don't have any ASVs for timedelta so this would be a good opportunity to add.

@wezzman are you interested in trying a PR for this?

@jreback
Copy link
Contributor

jreback commented Apr 12, 2018

timedelta is not a numeric type (e.g. float, int); so this is correct.

certainly could add more benchmarks though.

@jreback
Copy link
Contributor

jreback commented Apr 12, 2018

#18053 is different, its about Period which are currently aggregated using object dtype, not the underlying integer represenation (@TomAugspurger working on making this an EA which would be much faster).

@WillAyd
Copy link
Member

WillAyd commented Apr 12, 2018

Understood on the numeric type check, but you don't have any objection to relaxing the numeric_only requirement for sum do you? I don't think it would change the outcome of the operation but just get you there faster

@wezzman
Copy link
Author

wezzman commented Apr 16, 2018

@WillAyd, I am unsure what you mean by a PR.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 18, 2018 via email

@kaijfox
Copy link

kaijfox commented Jul 24, 2018

I know this is a fairly old thread, but I thought I'd mention that this is a problem for dtype bool too.

@WillAyd
Copy link
Member

WillAyd commented Jul 24, 2018

@kaijfox if you are up for it PRs are always welcome. I believe the resolution was highlighted in some comments above so if you can piece together with benchmarks would certainly take a look!

@mroeschke
Copy link
Member

The performance looks comparable now. Could use some ASVs

In [2]: import numpy as np
   ...: import pandas as pd
   ...:
   ...: td = pd.DataFrame(np.random.randint(1000000, 100000000, (10000,100)), index=np.random.randint(200, size=(1
   ...: 0000,))).astype('timedelta64[ns]')
   ...: i = td.copy(deep=True).astype('int64')
   ...:
   ...: %time temp1 = td.groupby(lambda x: x).sum() # 2.33 s
   ...: %time temp2 = i.groupby(lambda x: x).sum() # 15.6 ms
CPU times: user 3.01 ms, sys: 9 µs, total: 3.02 ms
Wall time: 3.02 ms
CPU times: user 7.68 ms, sys: 81 µs, total: 7.76 ms
Wall time: 7.77 ms

@mroeschke mroeschke added Benchmark Performance (ASV) benchmarks good first issue and removed Groupby Performance Memory or execution speed performance Timedelta Timedelta data type labels Jun 19, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@josemayer
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks good first issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants