Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selecting datetimes within a certain range (days) without having to specify the start and end dates precisely (hours or higher) #243

Closed
chuaxr opened this issue Dec 2, 2017 · 8 comments

Comments

@chuaxr
Copy link

chuaxr commented Dec 2, 2017

I generally compute statistics over the last ten days of output. Ideally, I would just need to specify up to the precision of the day when creating the datetime object for the start and end dates. However, when dealing with hourly output, I needed to specify up to the hour of the starting date:

default_start_date=datetime.datetime(2001, 7, 20,1),

I started dealing with high frequency (5 min) output, which meant that specifying a default start date based on the hourly data would omit some of the data from the high frequency run. For some reason, there are some rounding errors which make it impossible to specify the start time precisely with datetime (which stops at microsecond precision). (See error message below, although specifying from datetime.datetime(2001,6,30,1,0,0) to datetime.datetime(2001,7,10,0,0,0) actually works.)

Is it possible to select values within a range, rather than having to specify the start and end dates precisely?

Traceback (most recent call last):
  File "/nbhome/xrc/anaconda2/envs/py361/lib/python3.6/site-packages/aospy/automate.py", line 253, in _compute_or_skip_on_error
    return calc.compute(**compute_kwargs)
  File "/nbhome/xrc/anaconda2/envs/py361/lib/python3.6/site-packages/aospy/calc.py", line 617, in compute
    self.end_date),
  File "/nbhome/xrc/anaconda2/envs/py361/lib/python3.6/site-packages/aospy/calc.py", line 434, in _get_all_data
    for n, var in enumerate(self.variables)]
  File "/nbhome/xrc/anaconda2/envs/py361/lib/python3.6/site-packages/aospy/calc.py", line 434, in <listcomp>
    for n, var in enumerate(self.variables)]
  File "/nbhome/xrc/anaconda2/envs/py361/lib/python3.6/site-packages/aospy/calc.py", line 386, in _get_input_data
    **self.data_loader_attrs)
  File "/nbhome/xrc/anaconda2/envs/py361/lib/python3.6/site-packages/aospy/data_loader.py", line 273, in load_variable
    np.datetime64(end_date_xarray)).load()
  File "/nbhome/xrc/anaconda2/envs/py361/lib/python3.6/site-packages/aospy/utils/times.py", line 451, in sel_time
    _assert_has_data_for_time(da, start_date, end_date)
  File "/nbhome/xrc/anaconda2/envs/py361/lib/python3.6/site-packages/aospy/utils/times.py", line 422, in _assert_has_data_for_time
    da_start, da_end)
AssertionError: Data does not exist for requested time range: 2001-06-30T00:05:00.990857 to 2001-07-10T00:00:00.000000; found data from time range: 2001-06-30T00:05:00.990857216 to 2001-07-10T00:00:00.660471808.
@spencerkclark
Copy link
Collaborator

spencerkclark commented Dec 2, 2017

@chuaxr right now this is not possible in aospy. Ideally, you'd be able to use "partial datetime string indexing" here. This would mean that for all of your runs you could specify the strings '2001-06-30' and '2001-07-10' as time bounds for all output frequencies; see here for some examples for how this works in pandas.

But, the way date ranges are currently handled in aospy is limited by upstream issues. In addition to #208, one of my goals for the end of this month, leading into early next year is to finally return to my work on pydata/xarray#1252. With that in xarray, we'd be able to simplify a lot of internal logic within aospy and expose really convenient features like partial datetime string indexing to users.

@chuaxr
Copy link
Author

chuaxr commented Dec 3, 2017

Thanks for the quick reply @spencerkclark. In the meantime, is there a workaround that I could use to specify the start time for the 5 minute file? In this case the file exactly contains my desired time. There seems to be an option to do that in utils/times.py which has to do with a raw_data attribute, although it's not obvious how I could switch it on.

@spencerkclark
Copy link
Collaborator

My apologies, I think I may have misunderstood your issue. This is quite a severe bind! Outside of the changing the source code of aospy, the only way I can think of addressing this would be to change the first value in the time series to one that you can actually represent in a datetime with microsecond precision. But that would be messy. See below for the details regarding a cleaner fix (or an alternative quick fix) within the source code.

Is it possible to select values within a range, rather than having to specify the start and end dates precisely?

Yes, fundamentally aospy does select data based on time ranges. However, a while back (as part of #90) we introduced a strict check to make sure that all data for a requested time range was present in a data set before doing a calculation (the rationale for this is described in #72). Partial datetime string indexing, while nice, wouldn't help you here.

I'm pretty sure the issue stems from the decoding of the datetimes from your WRF run. The underlying raw time series (encoded as a series of floats representing some unit of time since a certain date) must not be encoded to enough precision to be precisely decoded into datetimes to nanosecond precision.

It is clear that having the strict check advised by #72 is not always desirable. I think it would be beneficial to have this as an option in each DataLoader (by default the strict check would be on) where one could relax it to allow for edge cases like this. This would essentially just be a flag to say whether utils.times. _assert_has_data_for_time is called within utils.times.sel_time.

If you're in need of a really quick fix, you could just comment out line 451 in utils.times.sel_time, but this would eliminate the strict check for all your calculations (but I'm pretty sure it wouldn't change any of your results).

@spencerkclark
Copy link
Collaborator

spencerkclark commented Dec 3, 2017

Perhaps a better fix would be to add a tolerance to the strict check. That is change:

range_exists = start_date >= da_start and end_date <= da_end

in utils.times._assert_has_data_for_time to:

range_exists  = start_date >= (da_start - tol) and end_date <= (da_end + tol)

where tol is a np.timedelta64 object with some default value (say one second) and could be modified on a DataLoader by DataLoader basis if desired. This sticks within the spirit of #72, but allows for some wiggle room (which is needed in the case of this issue).

@chuaxr
Copy link
Author

chuaxr commented Dec 3, 2017

Thanks for the tips. Your suggestion (see below) did work, and I also realized that casting the time coordinate to float in the preprocessing step also addressed the issue.

tol = np.timedelta64(1,'m')
range_exists  = start_date >= (da_start - tol) and end_date <= (da_end + tol)

Although I am now confused about whether it is possible to specify more than one date range for a Run with two different output frequencies. Ideally, I would like the partial datetime indexing solution. I also would prefer the ability to specify different ranges that depend on the time frequency (e.g. start at 5 minutes for the 5min output and start at 1 hour for the 1hr output) than manually toggling the date range for each calculation (which is what I am currently doing). Could that be specified in the data loader?

@spencerkclark
Copy link
Collaborator

spencerkclark commented Dec 3, 2017

We'll need to think carefully about how partial string indexing would fit in with this check, but that is something that I think would be a huge benefit for a number of use cases (including this one). That said, that's a ways down the road.

As a quick fix, I think you should just be able to increase the tolerance (maybe to something like a day). Then you should be able to specify a range of datetime(2001, 6, 30) to datetime(2001, 7, 10) for all of your output frequencies and aospy shouldn't complain.

@spencerkclark
Copy link
Collaborator

Although I am now confused about whether it is possible to specify more than one date range for a Run with two different output frequencies.

Just to be clear, this check has nothing to do with what time range actually gets selected in aospy, so even if we removed it entirely it would not preclude you from specifying multiple date ranges in your main script (if you needed it). It just prevents you from trying to use a date range that is asking for data that doesn't exist in the data set.

@spencerkclark
Copy link
Collaborator

Although I am now confused about whether it is possible to specify more than one date range for a Run with two different output frequencies.

@chuaxr reading this again I'm a little unsure if I interpreted things properly. Could you be a little more specific about what you mean here? What do you mean by "specify more than one date range for a Run with two different output frequencies?" Obviously you can only have one default date range per Run, so I'm assuming you mean in the main script? Or are you saying you'd like to be able to specify a single default date range in the Run that worked for both output frequencies? Could you post a code snippet of the constructor for the Run you are referring to? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants