Performance issue with pandas/core/common.py -> maybe_box_datetimelike #30520

ivan-vasilev · 2019-12-27T23:24:01Z

Code Sample, a copy-pastable example if possible

# The existing implementation is:
def maybe_box_datetimelike(value):
    # turn a datetime like into a Timestamp/timedelta as needed

    if isinstance(value, (np.datetime64, datetime)):
        value = tslibs.Timestamp(value)
    elif isinstance(value, (np.timedelta64, timedelta)):
        value = tslibs.Timedelta(value)

    return value

# Proposed improvement:
def maybe_box_datetimelike(value):
    # turn a datetime like into a Timestamp/timedelta as needed

    if isinstance(value, (np.datetime64, datetime)) and not isinstance(value, tslibs.Timestamp):
        value = tslibs.Timestamp(value)
    elif isinstance(value, (np.timedelta64, timedelta)):
        value = tslibs.Timedelta(value)

    return value

Problem description

This function determines whether value is of type (np.datetime64, datetime) and if so, converts it into tslibs.Timestamp. However, the class tslibs.Timestamp is already a subclass of datetime. Therefore, even if the object value is already of type tslibs.Timestamp, it will be needlessly converted one more time. This issue has large performance, when working with large dataframes, which contain datet time objects. This issue could be fixed by changing the condition
if isinstance(value, (np.datetime64, datetime)):
to:
if isinstance(value, (np.datetime64, datetime)) and not isinstance(value, tslibs.Timestamp):

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2019-12-28T00:20:42Z

This issue has large performance

Can you post some profiling results for this?

ivan-vasilev · 2019-12-28T21:51:32Z

This issue has large performance

Can you post some profiling results for this?

The following is the profiler output of the execution of my program with the current (non-optimized) version of the maybe_box_datetimelike function. This code is doing some data processing over a large dataframe with timestamps (I cannot include the full code). As you can see, maybe_box_datetimelike takes the majority of the time:

30040320 function calls (29839798 primitive calls) in 80.258 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     57/1    0.001    0.000   80.259   80.259 {built-in method builtins.exec}
        1    0.000    0.000   80.258   80.258 <string>:1(<module>)
        1    0.010    0.010   80.258   80.258 data_replay.py:305(start)
     6462    0.080    0.000   79.543    0.012 data_replay.py:82(__next__)
  13175/6    0.004    0.000   78.931   13.155 {built-in method builtins.next}
        6    0.001    0.000   78.931   13.155 data_replay.py:235(__next__)
   619461    0.362    0.000   59.800    0.000 tz.py:821(utcoffset)
   621231    0.372    0.000   59.628    0.000 tz.py:738(_find_ttinfo)
   621231    1.251    0.000   59.000    0.000 tz.py:808(_resolve_ambiguous_time)
  1242462   51.925    0.000   53.904    0.000 tz.py:1809(_datetime_to_timestamp)
        6    0.000    0.000   50.203    8.367 frame.py:1262(to_dict)
        6    0.828    0.138   50.195    8.366 frame.py:1386(<listcomp>)
  2246466    4.011    0.000   47.117    0.000 common.py:85(maybe_box_datetimelike)

Next is the profiler output of exactly the same code, but with the optimized (according to the proposed fix) maybe_box_datetimelike function. Because the value parameter is not converted to tslibs.Timestamp unnecessarily, maybe_box_datetimelike takes negligible portion of the execution time:

22608254 function calls (22400396 primitive calls) in 34.749 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     57/1    0.001    0.000   34.750   34.750 {built-in method builtins.exec}
        1    0.000    0.000   34.750   34.750 <string>:1(<module>)
        1    0.012    0.012   34.750   34.750 data_replay.py:305(start)
     8240    0.105    0.000   33.267    0.004 data_replay.py:82(__next__)
13974/805    0.005    0.000   32.617    0.041 {built-in method builtins.next}
        6    0.001    0.000   32.614    5.436 data_replay.py:235(__next__)
        6    0.000    0.000   28.921    4.820 data_replay.py:33(__next__)
        1    0.006    0.006   27.374   27.374 ts_util.py:437(__next__)
        1    0.002    0.002   26.650   26.650 ts_util.py:364(add_modes)
        3    0.000    0.000   19.312    6.437 series.py:1188(__setitem__)
        3    0.000    0.000   19.306    6.435 series.py:1191(setitem)
        3    0.005    0.002   19.277    6.426 series.py:1261(_set_with)
        3    0.006    0.002   18.413    6.138 series.py:1303(_set_labels)
        4    0.001    0.000   18.314    4.578 multi.py:2412(get_indexer)
      443    1.048    0.002   17.613    0.040 {pandas._libs.tslibs.conversion.datetime_to_datetime64}
   224359    0.123    0.000   17.590    0.000 tz.py:738(_find_ttinfo)
   222301    0.119    0.000   17.514    0.000 tz.py:821(utcoffset)
   224359    0.409    0.000   17.382    0.000 tz.py:808(_resolve_ambiguous_time)
   448718   15.094    0.000   15.765    0.000 tz.py:1809(_datetime_to_timestamp)
        4    0.021    0.005    9.501    2.375 {method 'get_indexer' of 'pandas._libs.index.BaseMultiIndexCodesEngine' objects}
      476    0.003    0.000    9.150    0.019 cast.py:880(maybe_infer_to_datetimelike)
      431    0.005    0.000    9.137    0.021 cast.py:924(try_datetime)
1912/1865    0.024    0.000    8.965    0.005 base.py:277(__new__)
     1356    0.005    0.000    8.871    0.007 datetimes.py:263(__new__)
       36    0.000    0.000    8.855    0.246 categorical.py:340(__init__)
       10    0.000    0.000    8.853    0.885 multi.py:364(from_arrays)
       10    0.000    0.000    8.852    0.885 categorical.py:2794(_factorize_from_iterables)
       30    0.000    0.000    8.852    0.295 categorical.py:2816(<genexpr>)
       20    0.000    0.000    8.852    0.443 categorical.py:2754(_factorize_from_iterable)
     1370    0.005    0.000    8.815    0.006 datetimes.py:422(_from_sequence)
        3    0.000    0.000    8.808    2.936 multi.py:432(from_tuples)
     1370    0.011    0.000    8.801    0.006 datetimes.py:1784(sequence_to_dt64ns)
       46    0.000    0.000    8.763    0.190 base.py:2957(get_indexer)
     3689    0.003    0.000    8.733    0.002 base.py:5707(ensure_index)
       14    0.000    0.000    8.729    0.623 datetimes.py:1927(objects_to_datetime64ns)
   224359    0.352    0.000    8.448    0.000 tz.py:779(is_ambiguous)
   224359    0.244    0.000    8.402    0.000 tz.py:712(_find_last_transition)
        4    0.000    0.000    4.713    1.178 groupby.py:695(apply)
        4    0.000    0.000    4.712    1.178 groupby.py:741(_python_apply_general)
        3    0.000    0.000    4.692    1.564 ts_util.py:341(compute_mode)
        4    0.000    0.000    4.626    1.157 ops.py:189(apply)
      202    3.993    0.020    3.993    0.020 {pandas._libs.tslib.ints_to_pydatetime}
  1629148    0.149    0.000    3.670    0.000 datetimes.py:683(__iter__)
       19    0.028    0.001    3.410    0.179 generic.py:3524(take)
9961/9843    0.208    0.000    3.405    0.000 {built-in method numpy.array}
4939/4872    0.002    0.000    3.385    0.001 _asarray.py:16(asarray)
        6    0.000    0.000    3.382    0.564 frame.py:1262(to_dict)
        6    0.371    0.062    3.375    0.563 frame.py:1386(<listcomp>)
      511    0.053    0.000    3.341    0.007 multi.py:3112(equals)
       22    0.000    0.000    3.253    0.148 datetimes.py:676(__array__)
       22    0.099    0.004    3.253    0.148 datetimelike.py:396(__array__)
        4    0.000    0.000    3.022    0.755 ts_util.py:22(set_periods)
        4    0.000    0.000    3.003    0.751 frame.py:6737(apply)
        4    0.000    0.000    3.003    0.751 apply.py:144(get_result)
        4    0.000    0.000    3.003    0.751 apply.py:261(apply_standard)
        4    0.003    0.001    2.998    0.750 apply.py:297(apply_series_generator)
      434    0.004    0.000    2.689    0.006 indexing.py:199(__setitem__)
        7    0.000    0.000    2.673    0.382 ops.py:917(_get_sorted_data)
      428    0.004    0.000    2.553    0.006 ts_util.py:41(a)
        6    0.001    0.000    1.942    0.324 ts_util.py:353(_price_mode)
      434    0.003    0.000    1.918    0.004 indexing.py:168(_get_setitem_indexer)
  2246466    0.509    0.000    1.743    0.000 frame.py:1386(<genexpr>)
      863    0.006    0.000    1.681    0.002 indexing.py:1205(_convert_to_indexer)
        1    0.000    0.000    1.631    1.631 dispatcher.py:326(_compile_for_args)
    287/1    0.001    0.000    1.631    1.631 compiler_lock.py:29(_acquire_compile_lock)
     50/1    0.000    0.000    1.631    1.631 dispatcher.py:744(compile)
     12/1    0.000    0.000    1.630    1.630 dispatcher.py:76(compile)
     12/1    0.000    0.000    1.630    1.630 dispatcher.py:83(_compile_cached)
     12/1    0.000    0.000    1.630    1.630 dispatcher.py:98(_compile_core)
     12/1    0.000    0.000    1.630    1.630 compiler.py:502(compile_extra)
      501    0.003    0.000    1.613    0.003 base.py:1185(__iter__)
       52    0.002    0.000    1.599    0.031 multi.py:614(_values)
       52    0.000    0.000    1.597    0.031 multi.py:1340(values)
     12/1    0.000    0.000    1.562    1.562 compiler.py:313(compile_extra)
     12/1    0.000    0.000    1.562    1.562 compiler.py:380(_compile_bytecode)
     13/1    0.000    0.000    1.562    1.562 compiler.py:343(_compile_core)
     13/1    0.001    0.000    1.562    1.562 compiler_machinery.py:305(run)
   236/20    0.003    0.000    1.561    0.078 compiler_machinery.py:263(_runPass)
   708/60    0.001    0.000    1.561    0.026 compiler_machinery.py:267(check)
        1    0.003    0.003    1.545    1.545 ts_util.py:252(__next__)
      428    0.003    0.000    1.524    0.004 indexing.py:247(_convert_tuple)
8341/8240    0.036    0.000    1.470    0.000 events.py:55(__call__)
       25    0.072    0.003    1.469    0.059 {pandas._libs.lib.map_infer}
       21    0.000    0.000    1.469    0.070 datetimelike.py:346(_box_values)
       13    0.000    0.000    1.465    0.113 datetimelike.py:686(astype)
       19    0.000    0.000    1.464    0.077 datetimes.py:706(astype)
       13    0.000    0.000    1.463    0.113 datetimelike.py:516(astype)
   175297    1.247    0.000    1.412    0.000 datetimes.py:603(<lambda>)
  2246466    0.948    0.000    1.405    0.000 common.py:85(maybe_box_datetimelike)

If you wish I can provide a specific code, which can reproduce the issue. Nevertheless, even if you don't see this as a performance issue, I think it still makes sense to fix it, as it is not necessary to convert value to tslibs.Timestamp unnecessarily, since value is already an instance of this class. Looking forward to your response.

jbrockmendel · 2019-12-28T21:54:04Z

That's a pretty compelling performance difference, thanks. PR would be welcome.

jreback · 2019-12-28T22:04:38Z

pls show a specific reproducible example that exhibits the perf issue - it will be needed for asvs in any event

jschendel · 2019-12-29T16:32:56Z

Is the performance issue here really with maybe_box_datetimelike? Or is the performance of maybe_box_datetimelike just a symptom of the Timestamp constructor being slow when passed a Timestamp?

From some quick adhoc timings it looks like >70% of the time spent in maybe_box_datetimelike is on Timestamp(Timestamp('2020')):

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.26.0.dev0+1469.ge817ffff3'

In [2]: ts = pd.Timestamp('2020')

In [3]: %timeit pd.Timestamp(ts)
856 ns ± 12.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit pd.core.common.maybe_box_datetimelike(ts)
1.18 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Which looks like can be improved by a simple isinstance check:

In [5]: def maybe_coerce_ts(ts): 
   ...:     if not isinstance(ts, pd.Timestamp): 
   ...:         return pd.Timestamp(ts) 
   ...:     return ts 
   ...:

In [6]: %timeit maybe_coerce_ts(ts)
118 ns ± 0.631 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

So, to me, it looks like the fix should go in the Timestamp constructor itself.

jbrockmendel · 2019-12-29T17:57:25Z

So, to me, it looks like the fix should go in the Timestamp constructor itself.

This seems reasonable. We would have to check that there aren't any other arguments passed, not sure what kind of hit that would entail. Might pick up some ground by not needing to create a new object.

jschendel · 2019-12-29T18:31:55Z

We would have to check that there aren't any other arguments passed

Ah, yeah, that makes it a little more convoluted. I'm fine with making both suggested changes, to the constructor and maybe_box_datetimelike, since we probably won't be able to get all the perf improvements of the isinstance check in maybe_box_datetimelike pushed back to the constructor. I'll create a separate issue for the constructor.

ivan-vasilev · 2019-12-29T23:02:29Z

I'm currently trying to write the ASV performance test for this pull request. If I understand correctly, the test should be located under pandas/asv_bench/benchmarks/. Are there any other requirements about this benchmark? E.g. should I place it an existing benchmark, or alternatively, should I create a new file? Thanks in advance.

jbrockmendel · 2019-12-30T01:31:06Z

E.g. should I place it an existing benchmark, or alternatively, should I create a new file? Thanks in advance.

Depends a little bit on what you're profiling. If youre profiling only the Timestamp constructor, then it goes in benchmarks/tslibs/timestamp.py::TimestampConstruction. If you're profiling maybe_box_datetimelike, maybe inference.py? You've got some discretion in this case.

jreback · 2020-01-24T04:38:29Z

this is subsumed by #30676 (though if you still think there is a perf benefit after that certainly ping)

jbrockmendel added the Performance Memory or execution speed performance label Dec 28, 2019

ivan-vasilev added a commit to ivan-vasilev/pandas that referenced this issue Dec 28, 2019

Issue pandas-dev#30520

74f4b8b

ivan-vasilev mentioned this issue Dec 28, 2019

Performance of maybe_box_datetimelike #30520 #30531

Closed

5 tasks

ivan-vasilev added a commit to ivan-vasilev/pandas that referenced this issue Dec 28, 2019

Issue pandas-dev#30520

dcb45ed

ivan-vasilev added a commit to ivan-vasilev/pandas that referenced this issue Dec 28, 2019

Issue pandas-dev#30520 works for timedeltas as well

6a2511c

ivan-vasilev added a commit to ivan-vasilev/pandas that referenced this issue Dec 29, 2019

Issue pandas-dev#30520 check for NaT values

54cb25e

jschendel mentioned this issue Dec 29, 2019

PERF: Timestamp/Timedelta constructors when passed a Timestamp/Timedelta #30543

Closed

ivan-vasilev added a commit to ivan-vasilev/pandas that referenced this issue Dec 29, 2019

Issue pandas-dev#30520 more readable code + specifically check for NaT

524150c

jreback closed this as completed Jan 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue with pandas/core/common.py -> maybe_box_datetimelike #30520

Performance issue with pandas/core/common.py -> maybe_box_datetimelike #30520

ivan-vasilev commented Dec 27, 2019

jbrockmendel commented Dec 28, 2019

ivan-vasilev commented Dec 28, 2019

jbrockmendel commented Dec 28, 2019

jreback commented Dec 28, 2019

jschendel commented Dec 29, 2019

jbrockmendel commented Dec 29, 2019

jschendel commented Dec 29, 2019

ivan-vasilev commented Dec 29, 2019 •

edited

Loading

jbrockmendel commented Dec 30, 2019

jreback commented Jan 24, 2020

Performance issue with pandas/core/common.py -> maybe_box_datetimelike #30520

Performance issue with pandas/core/common.py -> maybe_box_datetimelike #30520

Comments

ivan-vasilev commented Dec 27, 2019

Code Sample, a copy-pastable example if possible

Problem description

jbrockmendel commented Dec 28, 2019

ivan-vasilev commented Dec 28, 2019

jbrockmendel commented Dec 28, 2019

jreback commented Dec 28, 2019

jschendel commented Dec 29, 2019

jbrockmendel commented Dec 29, 2019

jschendel commented Dec 29, 2019

ivan-vasilev commented Dec 29, 2019 • edited Loading

jbrockmendel commented Dec 30, 2019

jreback commented Jan 24, 2020

ivan-vasilev commented Dec 29, 2019 •

edited

Loading