DataFrame.groupby.sum() is extremely slow when dtype is timedelta64[ns] compared to int64. #20660

wezzman · 2018-04-11T21:18:59Z

Steps to demonstrate

import numpy as np
import pandas as pd

td = pd.DataFrame(np.random.randint(1000000, 100000000, (10000,100)), index=np.random.randint(200, size=(10000,))).astype('timedelta64[ns]')
i = td.copy(deep=True).astype('int64')

%time temp1 = td.groupby(lambda x: x).sum() # 2.33 s
%time temp2 = i.groupby(lambda x: x).sum() # 15.6 ms

temp2 = temp2.astype('timedelta64[ns]')
assert((temp1 == temp2).values.all())

Problem description

When performing a summation on grouped 'timedelta64[ns]' data, there is a significant performance decrease compared to the same data interpreted as 'int64'.

Possibly related to #18053

Expected Behavior

It is my understanding that internally 'timedelta64[ns]' are just 'int64' and are interpreted as a count of 'ns'. Shouldn't the summation performance be equal in that case?

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.1
setuptools: 39.0.1
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-04-12T18:56:49Z

Thanks for the report.

cc @WillAyd if you have any pointers on where someone could get started here?

WillAyd · 2018-04-12T21:03:10Z

Hmm well after a quick profile it appears that the timedelta data does not leverage the Cython function. The problem starts on the below line:

pandas/pandas/core/groupby/groupby.py

Line 3938 in fa231e8

data = data.get_numeric_data(copy=False)

For timedelta that internal method returns an empty tuple, so block aggregation fails and the execution falls back to a much slower group-by-group summation.

Perhaps the easiest fix is to add numeric_only=False to the sum method below:

pandas/pandas/core/groupby/groupby.py

Line 1431 in fa231e8

cls.sum = groupby_function('sum', 'add', np.sum, min_count=0)

Running this locally gets the examples back in line with one another:

In [12]: %timeit td.groupby(lambda x: x).sum()
8.53 ms ± 30.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [13]: %timeit i.groupby(lambda x: x).sum()
9.66 ms ± 30.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It's not entirely clear to me what the overall purpose of numeric_only is so long term maybe we don't even need it, but for the time being the above should work (cc @jreback)

FWIW we don't have any ASVs for timedelta so this would be a good opportunity to add.

@wezzman are you interested in trying a PR for this?

jreback · 2018-04-12T21:05:23Z

timedelta is not a numeric type (e.g. float, int); so this is correct.

certainly could add more benchmarks though.

jreback · 2018-04-12T21:06:37Z

#18053 is different, its about Period which are currently aggregated using object dtype, not the underlying integer represenation (@TomAugspurger working on making this an EA which would be much faster).

WillAyd · 2018-04-12T21:14:55Z

Understood on the numeric type check, but you don't have any objection to relaxing the numeric_only requirement for sum do you? I don't think it would change the outcome of the operation but just get you there faster

wezzman · 2018-04-16T13:30:16Z

@WillAyd, I am unsure what you mean by a PR.

TomAugspurger · 2018-04-18T18:21:45Z

PR="Pull Request" http://pandas-docs.github.io/pandas-docs-travis/contributing.html

…

On Mon, Apr 16, 2018 at 8:33 AM, wezzman ***@***.***> wrote: @WillAyd <https://github.com/WillAyd>, I am unsure what you mean by a PR. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20660 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIhd_cp-KvsEdtudO-VQliiyRuZo8ks5tpJ3FgaJpZM4TQxJA> .

kaijfox · 2018-07-24T00:57:25Z

I know this is a fairly old thread, but I thought I'd mention that this is a problem for dtype bool too.

WillAyd · 2018-07-24T00:59:49Z

@kaijfox if you are up for it PRs are always welcome. I believe the resolution was highlighted in some comments above so if you can piece together with benchmarks would certainly take a look!

mroeschke · 2021-06-19T18:02:17Z

The performance looks comparable now. Could use some ASVs

In [2]: import numpy as np
   ...: import pandas as pd
   ...:
   ...: td = pd.DataFrame(np.random.randint(1000000, 100000000, (10000,100)), index=np.random.randint(200, size=(1
   ...: 0000,))).astype('timedelta64[ns]')
   ...: i = td.copy(deep=True).astype('int64')
   ...:
   ...: %time temp1 = td.groupby(lambda x: x).sum() # 2.33 s
   ...: %time temp2 = i.groupby(lambda x: x).sum() # 15.6 ms
CPU times: user 3.01 ms, sys: 9 µs, total: 3.02 ms
Wall time: 3.02 ms
CPU times: user 7.68 ms, sys: 81 µs, total: 7.76 ms
Wall time: 7.77 ms

josemayer · 2023-09-22T15:17:33Z

take

TomAugspurger added Groupby Performance Memory or execution speed performance Timedelta Timedelta data type labels Apr 12, 2018

TomAugspurger added this to the Next Major Release milestone Apr 12, 2018

mroeschke added Benchmark Performance (ASV) benchmarks good first issue and removed Groupby Performance Memory or execution speed performance Timedelta Timedelta data type labels Jun 19, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

github-actions bot assigned josemayer Sep 22, 2023

josemayer mentioned this issue Sep 25, 2023

ASV: add GroupBy.sum benchmark with timedelta and integer types #55273

Merged

4 tasks

mroeschke closed this as completed in #55273 Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.groupby.sum() is extremely slow when dtype is timedelta64[ns] compared to int64. #20660

DataFrame.groupby.sum() is extremely slow when dtype is timedelta64[ns] compared to int64. #20660

wezzman commented Apr 11, 2018

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

TomAugspurger commented Apr 12, 2018

WillAyd commented Apr 12, 2018

jreback commented Apr 12, 2018

jreback commented Apr 12, 2018

WillAyd commented Apr 12, 2018 •

edited

Loading

wezzman commented Apr 16, 2018

TomAugspurger commented Apr 18, 2018 via email

kaijfox commented Jul 24, 2018

WillAyd commented Jul 24, 2018 •

edited

Loading

mroeschke commented Jun 19, 2021

josemayer commented Sep 22, 2023

DataFrame.groupby.sum() is extremely slow when dtype is timedelta64[ns] compared to int64. #20660

DataFrame.groupby.sum() is extremely slow when dtype is timedelta64[ns] compared to int64. #20660

Comments

wezzman commented Apr 11, 2018

Steps to demonstrate

Problem description

Expected Behavior

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

TomAugspurger commented Apr 12, 2018

WillAyd commented Apr 12, 2018

jreback commented Apr 12, 2018

jreback commented Apr 12, 2018

WillAyd commented Apr 12, 2018 • edited Loading

wezzman commented Apr 16, 2018

TomAugspurger commented Apr 18, 2018 via email

kaijfox commented Jul 24, 2018

WillAyd commented Jul 24, 2018 • edited Loading

mroeschke commented Jun 19, 2021

josemayer commented Sep 22, 2023

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

WillAyd commented Apr 12, 2018 •

edited

Loading

WillAyd commented Jul 24, 2018 •

edited

Loading