update SRML data to strip trailing nans #572

lboeman · 2020-09-15T00:27:58Z

Closes SRML data pull needs adjustment #543 .
I am familiar with the contributing guidelines.
[] Tests added.
Updates entries to docs/source/api.rst for API changes.
Adds descriptions to appropriate "what's new" file in docs/source/whatsnew for all changes. Includes link to the GitHub Issue with :issue:`num` or this Pull Request with :pull:`num`. Includes contributor name and/or GitHub username (link with :ghuser:`user`).
New code is fully documented. Includes numpydoc compliant docstrings, examples, and comments where necessary.
Maintainer: Appropriate GitHub Labels and Milestone are assigned to the Pull Request and linked Issue.

Gaps in SRML data have been caused by posting NaN when posting reference data. These NaN values exist because SRML files are prefilled through the given month with NaN values during the day and 0s at night. Posting the NaN values that occur between the last measurement and the end of the period we want to update causes the /observations/{uuid}/values/latest endpoint of the arbiter API to return the last timestamp filled with a NaN. The reference data update script then uses as the latest time as the start of the range to update. A simple call to dropna is insufficient because NaNs are also used to represent truly missing data.

This updates the solarforecastarbiter.io.reference_observations.srml.fetch function to slice data from start to end, and then slice again until the last valid index. This should handle dropping the trailing NaNs that cause the gaps in data as can be seen at the Hermiston OR site .

This should handle sites with a lag in data availability of less than 24 hours, which seems to be the most common case. One caveat is that it must run before sunset so as to not catch the prefilled nighttime 0s, I'm open to any other ideas on how to detect the true last valid value.

wholmgren

Seems like we're fighting against the SRML convention. Wouldn't life be easier if we committed to refilling the entire month with each data fetch? We're already downloading the data.

docs/source/whatsnew/1.0.0rc3.rst

solarforecastarbiter/io/reference_observations/srml.py

alorenzo175 · 2020-09-15T16:29:01Z

I guess one problem I see is that .last_valid_index() may also exclude valid missing data observations. Although if we keep the current cronjob of looking from now - 10 days to now instead of using the latest value from the API, it shouldn't be a problem in practice and this would make the /latest endpoint serviceable for other purposes.

So I guess keep this like it is, and we'll also keep the cronjob with the set start time instead of relying on /latest

lboeman · 2020-09-15T16:56:01Z

Seems like we're fighting against the SRML convention. Wouldn't life be easier if we committed to refilling the entire month with each data fetch? We're already downloading the data.

Yes, that does seem easier but because the common update function slices data anyway, I tried to do this to maintain functionality of the /latest endpoint.
It seems like @alorenzo175 s fix of adjusting the cron job and not utilizing the /latest endpoint in the update is roughly in line with your suggestion of just reposting data and ensure that we don't have these gaps and eventually include any valid missing data points.

Co-authored-by: Will Holmgren <william.holmgren@gmail.com>

alorenzo175 · 2020-09-15T17:17:04Z

solarforecastarbiter/io/reference_observations/srml.py

@@ -154,7 +162,16 @@ def fetch(api, site, start, end):
    # adjust power from watts to megawatts
    for column in power_columns:
        all_period_data[column] = all_period_data[column] / 1000000
-    all_period_data = all_period_data[var_columns]
+    all_period_data = all_period_data.loc[start:end, var_columns]


Is this the ValueError @wholmgren mentioned (or KeyError)? Either way, instead of raising, I would prefer catching and logging like line 154 to avoid problems with data ingestion job

https://github.com/SolarArbiter/solarforecastarbiter-core/pull/572/files#diff-de89b40a63393dc73c01aa714c77e271R154-R157

that's all I meant

@wholmgren I accepted the comment without double checking, the ValueError is not raised, but a warning is logged and an empty data frame is returned. So I've removed the Raises docstring I added.

@alorenzo175 I think the only expected Exception here would be the TypeError if start/end don't have the same timezone above. I don't believe there are any exceptions that would crop up here cron run to cron run and cause a crash. Is there a potential issue you see here?

lboeman added the bug Something isn't working label Sep 15, 2020

lboeman added this to the 1.0 rc3 milestone Sep 15, 2020

wholmgren reviewed Sep 15, 2020

View reviewed changes

docs/source/whatsnew/1.0.0rc3.rst Outdated Show resolved Hide resolved

solarforecastarbiter/io/reference_observations/srml.py Show resolved Hide resolved

solarforecastarbiter/io/reference_observations/srml.py Outdated Show resolved Hide resolved

lboeman and others added 3 commits September 15, 2020 09:57

update SRML data to strip trailing nans

8cfd31d

Apply suggestions from code review

092c276

Co-authored-by: Will Holmgren <william.holmgren@gmail.com>

add possible ValueError to Raises docstring

fb40558

lboeman force-pushed the srml-data-gaps branch from fb17286 to fb40558 Compare September 15, 2020 16:57

alorenzo175 reviewed Sep 15, 2020

View reviewed changes

remove valueerror raises doc

7e55745

alorenzo175 approved these changes Sep 15, 2020

View reviewed changes

alorenzo175 merged commit a924876 into SolarArbiter:master Sep 15, 2020

lboeman deleted the srml-data-gaps branch August 16, 2021 19:54

wholmgren mentioned this pull request Jun 27, 2023

Add get_srml iotools function; deprecate read_srml_month_from_solardat pvlib/pvlib-python#1779

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update SRML data to strip trailing nans #572

update SRML data to strip trailing nans #572

lboeman commented Sep 15, 2020

wholmgren left a comment

alorenzo175 commented Sep 15, 2020

lboeman commented Sep 15, 2020

alorenzo175 Sep 15, 2020

wholmgren Sep 15, 2020

wholmgren Sep 15, 2020

lboeman Sep 15, 2020

update SRML data to strip trailing nans #572

update SRML data to strip trailing nans #572

Conversation

lboeman commented Sep 15, 2020

wholmgren left a comment

Choose a reason for hiding this comment

alorenzo175 commented Sep 15, 2020

lboeman commented Sep 15, 2020

alorenzo175 Sep 15, 2020

Choose a reason for hiding this comment

wholmgren Sep 15, 2020

Choose a reason for hiding this comment

wholmgren Sep 15, 2020

Choose a reason for hiding this comment

lboeman Sep 15, 2020

Choose a reason for hiding this comment