-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update SRML data to strip trailing nans #572
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we're fighting against the SRML convention. Wouldn't life be easier if we committed to refilling the entire month with each data fetch? We're already downloading the data.
I guess one problem I see is that So I guess keep this like it is, and we'll also keep the cronjob with the set start time instead of relying on /latest |
Yes, that does seem easier but because the common update function slices data anyway, I tried to do this to maintain functionality of the |
Co-authored-by: Will Holmgren <william.holmgren@gmail.com>
fb17286
to
fb40558
Compare
@@ -154,7 +162,16 @@ def fetch(api, site, start, end): | |||
# adjust power from watts to megawatts | |||
for column in power_columns: | |||
all_period_data[column] = all_period_data[column] / 1000000 | |||
all_period_data = all_period_data[var_columns] | |||
all_period_data = all_period_data.loc[start:end, var_columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the ValueError @wholmgren mentioned (or KeyError)? Either way, instead of raising, I would prefer catching and logging like line 154 to avoid problems with data ingestion job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's all I meant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wholmgren I accepted the comment without double checking, the ValueError
is not raised, but a warning is logged and an empty data frame is returned. So I've removed the Raises docstring I added.
@alorenzo175 I think the only expected Exception here would be the TypeError
if start/end don't have the same timezone above. I don't believe there are any exceptions that would crop up here cron run to cron run and cause a crash. Is there a potential issue you see here?
docs/source/api.rst
for API changes.docs/source/whatsnew
for all changes. Includes link to the GitHub Issue with:issue:`num`
or this Pull Request with:pull:`num`
. Includes contributor name and/or GitHub username (link with:ghuser:`user`
).Gaps in SRML data have been caused by posting NaN when posting reference data. These NaN values exist because SRML files are prefilled through the given month with NaN values during the day and 0s at night. Posting the NaN values that occur between the last measurement and the end of the period we want to update causes the
/observations/{uuid}/values/latest
endpoint of the arbiter API to return the last timestamp filled with a NaN. The reference data update script then uses as the latest time as the start of the range to update. A simple call todropna
is insufficient because NaNs are also used to represent truly missing data.This updates the
solarforecastarbiter.io.reference_observations.srml.fetch
function to slice data from start to end, and then slice again until the last valid index. This should handle dropping the trailing NaNs that cause the gaps in data as can be seen at the Hermiston OR site .This should handle sites with a lag in data availability of less than 24 hours, which seems to be the most common case. One caveat is that it must run before sunset so as to not catch the prefilled nighttime 0s, I'm open to any other ideas on how to detect the true last valid value.