Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with pandas/core/common.py -> maybe_box_datetimelike #30520

Closed
ivan-vasilev opened this issue Dec 27, 2019 · 10 comments
Closed
Labels
Performance Memory or execution speed performance

Comments

@ivan-vasilev
Copy link

Code Sample, a copy-pastable example if possible

# The existing implementation is:
def maybe_box_datetimelike(value):
    # turn a datetime like into a Timestamp/timedelta as needed

    if isinstance(value, (np.datetime64, datetime)):
        value = tslibs.Timestamp(value)
    elif isinstance(value, (np.timedelta64, timedelta)):
        value = tslibs.Timedelta(value)

    return value

# Proposed improvement:
def maybe_box_datetimelike(value):
    # turn a datetime like into a Timestamp/timedelta as needed

    if isinstance(value, (np.datetime64, datetime)) and not isinstance(value, tslibs.Timestamp):
        value = tslibs.Timestamp(value)
    elif isinstance(value, (np.timedelta64, timedelta)):
        value = tslibs.Timedelta(value)

    return value

Problem description

This function determines whether value is of type (np.datetime64, datetime) and if so, converts it into tslibs.Timestamp. However, the class tslibs.Timestamp is already a subclass of datetime. Therefore, even if the object value is already of type tslibs.Timestamp, it will be needlessly converted one more time. This issue has large performance, when working with large dataframes, which contain datet time objects. This issue could be fixed by changing the condition
if isinstance(value, (np.datetime64, datetime)):
to:
if isinstance(value, (np.datetime64, datetime)) and not isinstance(value, tslibs.Timestamp):

@jbrockmendel
Copy link
Member

This issue has large performance

Can you post some profiling results for this?

@ivan-vasilev
Copy link
Author

This issue has large performance

Can you post some profiling results for this?

The following is the profiler output of the execution of my program with the current (non-optimized) version of the maybe_box_datetimelike function. This code is doing some data processing over a large dataframe with timestamps (I cannot include the full code). As you can see, maybe_box_datetimelike takes the majority of the time:

30040320 function calls (29839798 primitive calls) in 80.258 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     57/1    0.001    0.000   80.259   80.259 {built-in method builtins.exec}
        1    0.000    0.000   80.258   80.258 <string>:1(<module>)
        1    0.010    0.010   80.258   80.258 data_replay.py:305(start)
     6462    0.080    0.000   79.543    0.012 data_replay.py:82(__next__)
  13175/6    0.004    0.000   78.931   13.155 {built-in method builtins.next}
        6    0.001    0.000   78.931   13.155 data_replay.py:235(__next__)
   619461    0.362    0.000   59.800    0.000 tz.py:821(utcoffset)
   621231    0.372    0.000   59.628    0.000 tz.py:738(_find_ttinfo)
   621231    1.251    0.000   59.000    0.000 tz.py:808(_resolve_ambiguous_time)
  1242462   51.925    0.000   53.904    0.000 tz.py:1809(_datetime_to_timestamp)
        6    0.000    0.000   50.203    8.367 frame.py:1262(to_dict)
        6    0.828    0.138   50.195    8.366 frame.py:1386(<listcomp>)
  2246466    4.011    0.000   47.117    0.000 common.py:85(maybe_box_datetimelike)

Next is the profiler output of exactly the same code, but with the optimized (according to the proposed fix) maybe_box_datetimelike function. Because the value parameter is not converted to tslibs.Timestamp unnecessarily, maybe_box_datetimelike takes negligible portion of the execution time:

22608254 function calls (22400396 primitive calls) in 34.749 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     57/1    0.001    0.000   34.750   34.750 {built-in method builtins.exec}
        1    0.000    0.000   34.750   34.750 <string>:1(<module>)
        1    0.012    0.012   34.750   34.750 data_replay.py:305(start)
     8240    0.105    0.000   33.267    0.004 data_replay.py:82(__next__)
13974/805    0.005    0.000   32.617    0.041 {built-in method builtins.next}
        6    0.001    0.000   32.614    5.436 data_replay.py:235(__next__)
        6    0.000    0.000   28.921    4.820 data_replay.py:33(__next__)
        1    0.006    0.006   27.374   27.374 ts_util.py:437(__next__)
        1    0.002    0.002   26.650   26.650 ts_util.py:364(add_modes)
        3    0.000    0.000   19.312    6.437 series.py:1188(__setitem__)
        3    0.000    0.000   19.306    6.435 series.py:1191(setitem)
        3    0.005    0.002   19.277    6.426 series.py:1261(_set_with)
        3    0.006    0.002   18.413    6.138 series.py:1303(_set_labels)
        4    0.001    0.000   18.314    4.578 multi.py:2412(get_indexer)
      443    1.048    0.002   17.613    0.040 {pandas._libs.tslibs.conversion.datetime_to_datetime64}
   224359    0.123    0.000   17.590    0.000 tz.py:738(_find_ttinfo)
   222301    0.119    0.000   17.514    0.000 tz.py:821(utcoffset)
   224359    0.409    0.000   17.382    0.000 tz.py:808(_resolve_ambiguous_time)
   448718   15.094    0.000   15.765    0.000 tz.py:1809(_datetime_to_timestamp)
        4    0.021    0.005    9.501    2.375 {method 'get_indexer' of 'pandas._libs.index.BaseMultiIndexCodesEngine' objects}
      476    0.003    0.000    9.150    0.019 cast.py:880(maybe_infer_to_datetimelike)
      431    0.005    0.000    9.137    0.021 cast.py:924(try_datetime)
1912/1865    0.024    0.000    8.965    0.005 base.py:277(__new__)
     1356    0.005    0.000    8.871    0.007 datetimes.py:263(__new__)
       36    0.000    0.000    8.855    0.246 categorical.py:340(__init__)
       10    0.000    0.000    8.853    0.885 multi.py:364(from_arrays)
       10    0.000    0.000    8.852    0.885 categorical.py:2794(_factorize_from_iterables)
       30    0.000    0.000    8.852    0.295 categorical.py:2816(<genexpr>)
       20    0.000    0.000    8.852    0.443 categorical.py:2754(_factorize_from_iterable)
     1370    0.005    0.000    8.815    0.006 datetimes.py:422(_from_sequence)
        3    0.000    0.000    8.808    2.936 multi.py:432(from_tuples)
     1370    0.011    0.000    8.801    0.006 datetimes.py:1784(sequence_to_dt64ns)
       46    0.000    0.000    8.763    0.190 base.py:2957(get_indexer)
     3689    0.003    0.000    8.733    0.002 base.py:5707(ensure_index)
       14    0.000    0.000    8.729    0.623 datetimes.py:1927(objects_to_datetime64ns)
   224359    0.352    0.000    8.448    0.000 tz.py:779(is_ambiguous)
   224359    0.244    0.000    8.402    0.000 tz.py:712(_find_last_transition)
        4    0.000    0.000    4.713    1.178 groupby.py:695(apply)
        4    0.000    0.000    4.712    1.178 groupby.py:741(_python_apply_general)
        3    0.000    0.000    4.692    1.564 ts_util.py:341(compute_mode)
        4    0.000    0.000    4.626    1.157 ops.py:189(apply)
      202    3.993    0.020    3.993    0.020 {pandas._libs.tslib.ints_to_pydatetime}
  1629148    0.149    0.000    3.670    0.000 datetimes.py:683(__iter__)
       19    0.028    0.001    3.410    0.179 generic.py:3524(take)
9961/9843    0.208    0.000    3.405    0.000 {built-in method numpy.array}
4939/4872    0.002    0.000    3.385    0.001 _asarray.py:16(asarray)
        6    0.000    0.000    3.382    0.564 frame.py:1262(to_dict)
        6    0.371    0.062    3.375    0.563 frame.py:1386(<listcomp>)
      511    0.053    0.000    3.341    0.007 multi.py:3112(equals)
       22    0.000    0.000    3.253    0.148 datetimes.py:676(__array__)
       22    0.099    0.004    3.253    0.148 datetimelike.py:396(__array__)
        4    0.000    0.000    3.022    0.755 ts_util.py:22(set_periods)
        4    0.000    0.000    3.003    0.751 frame.py:6737(apply)
        4    0.000    0.000    3.003    0.751 apply.py:144(get_result)
        4    0.000    0.000    3.003    0.751 apply.py:261(apply_standard)
        4    0.003    0.001    2.998    0.750 apply.py:297(apply_series_generator)
      434    0.004    0.000    2.689    0.006 indexing.py:199(__setitem__)
        7    0.000    0.000    2.673    0.382 ops.py:917(_get_sorted_data)
      428    0.004    0.000    2.553    0.006 ts_util.py:41(a)
        6    0.001    0.000    1.942    0.324 ts_util.py:353(_price_mode)
      434    0.003    0.000    1.918    0.004 indexing.py:168(_get_setitem_indexer)
  2246466    0.509    0.000    1.743    0.000 frame.py:1386(<genexpr>)
      863    0.006    0.000    1.681    0.002 indexing.py:1205(_convert_to_indexer)
        1    0.000    0.000    1.631    1.631 dispatcher.py:326(_compile_for_args)
    287/1    0.001    0.000    1.631    1.631 compiler_lock.py:29(_acquire_compile_lock)
     50/1    0.000    0.000    1.631    1.631 dispatcher.py:744(compile)
     12/1    0.000    0.000    1.630    1.630 dispatcher.py:76(compile)
     12/1    0.000    0.000    1.630    1.630 dispatcher.py:83(_compile_cached)
     12/1    0.000    0.000    1.630    1.630 dispatcher.py:98(_compile_core)
     12/1    0.000    0.000    1.630    1.630 compiler.py:502(compile_extra)
      501    0.003    0.000    1.613    0.003 base.py:1185(__iter__)
       52    0.002    0.000    1.599    0.031 multi.py:614(_values)
       52    0.000    0.000    1.597    0.031 multi.py:1340(values)
     12/1    0.000    0.000    1.562    1.562 compiler.py:313(compile_extra)
     12/1    0.000    0.000    1.562    1.562 compiler.py:380(_compile_bytecode)
     13/1    0.000    0.000    1.562    1.562 compiler.py:343(_compile_core)
     13/1    0.001    0.000    1.562    1.562 compiler_machinery.py:305(run)
   236/20    0.003    0.000    1.561    0.078 compiler_machinery.py:263(_runPass)
   708/60    0.001    0.000    1.561    0.026 compiler_machinery.py:267(check)
        1    0.003    0.003    1.545    1.545 ts_util.py:252(__next__)
      428    0.003    0.000    1.524    0.004 indexing.py:247(_convert_tuple)
8341/8240    0.036    0.000    1.470    0.000 events.py:55(__call__)
       25    0.072    0.003    1.469    0.059 {pandas._libs.lib.map_infer}
       21    0.000    0.000    1.469    0.070 datetimelike.py:346(_box_values)
       13    0.000    0.000    1.465    0.113 datetimelike.py:686(astype)
       19    0.000    0.000    1.464    0.077 datetimes.py:706(astype)
       13    0.000    0.000    1.463    0.113 datetimelike.py:516(astype)
   175297    1.247    0.000    1.412    0.000 datetimes.py:603(<lambda>)
  2246466    0.948    0.000    1.405    0.000 common.py:85(maybe_box_datetimelike)

If you wish I can provide a specific code, which can reproduce the issue. Nevertheless, even if you don't see this as a performance issue, I think it still makes sense to fix it, as it is not necessary to convert value to tslibs.Timestamp unnecessarily, since value is already an instance of this class. Looking forward to your response.

@jbrockmendel
Copy link
Member

That's a pretty compelling performance difference, thanks. PR would be welcome.

@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Dec 28, 2019
@jreback
Copy link
Contributor

jreback commented Dec 28, 2019

pls show a specific reproducible example that exhibits the perf issue - it will be needed for asvs in any event

ivan-vasilev added a commit to ivan-vasilev/pandas that referenced this issue Dec 28, 2019
ivan-vasilev added a commit to ivan-vasilev/pandas that referenced this issue Dec 28, 2019
ivan-vasilev added a commit to ivan-vasilev/pandas that referenced this issue Dec 28, 2019
ivan-vasilev added a commit to ivan-vasilev/pandas that referenced this issue Dec 29, 2019
@jschendel
Copy link
Member

Is the performance issue here really with maybe_box_datetimelike? Or is the performance of maybe_box_datetimelike just a symptom of the Timestamp constructor being slow when passed a Timestamp?

From some quick adhoc timings it looks like >70% of the time spent in maybe_box_datetimelike is on Timestamp(Timestamp('2020')):

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.26.0.dev0+1469.ge817ffff3'

In [2]: ts = pd.Timestamp('2020')

In [3]: %timeit pd.Timestamp(ts)
856 ns ± 12.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit pd.core.common.maybe_box_datetimelike(ts)
1.18 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Which looks like can be improved by a simple isinstance check:

In [5]: def maybe_coerce_ts(ts): 
   ...:     if not isinstance(ts, pd.Timestamp): 
   ...:         return pd.Timestamp(ts) 
   ...:     return ts 
   ...:

In [6]: %timeit maybe_coerce_ts(ts)
118 ns ± 0.631 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

So, to me, it looks like the fix should go in the Timestamp constructor itself.

@jbrockmendel
Copy link
Member

So, to me, it looks like the fix should go in the Timestamp constructor itself.

This seems reasonable. We would have to check that there aren't any other arguments passed, not sure what kind of hit that would entail. Might pick up some ground by not needing to create a new object.

@jschendel
Copy link
Member

We would have to check that there aren't any other arguments passed

Ah, yeah, that makes it a little more convoluted. I'm fine with making both suggested changes, to the constructor and maybe_box_datetimelike, since we probably won't be able to get all the perf improvements of the isinstance check in maybe_box_datetimelike pushed back to the constructor. I'll create a separate issue for the constructor.

@ivan-vasilev
Copy link
Author

ivan-vasilev commented Dec 29, 2019

I'm currently trying to write the ASV performance test for this pull request. If I understand correctly, the test should be located under pandas/asv_bench/benchmarks/. Are there any other requirements about this benchmark? E.g. should I place it an existing benchmark, or alternatively, should I create a new file? Thanks in advance.

@jbrockmendel
Copy link
Member

E.g. should I place it an existing benchmark, or alternatively, should I create a new file? Thanks in advance.

Depends a little bit on what you're profiling. If youre profiling only the Timestamp constructor, then it goes in benchmarks/tslibs/timestamp.py::TimestampConstruction. If you're profiling maybe_box_datetimelike, maybe inference.py? You've got some discretion in this case.

@jreback
Copy link
Contributor

jreback commented Jan 24, 2020

this is subsumed by #30676 (though if you still think there is a perf benefit after that certainly ping)

@jreback jreback closed this as completed Jan 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants