using `np.unique` to de-duplicate historical ensembles shuffles them #338

mathause · 2023-11-19T01:36:42Z

In separate_hist_future the historical ensembles are de-duplicated using np.unique. This shuffles the ensemble members "randomly" because np.unique sorts the output!

TLDR: This messes up the order of the residual variability and underestimates its influence.

This is relevant for train_gv, as the target (tas) and one of the predictors (gv_novolc_tas_s, the residual variability) is split. Now the target and residual variability have random order, leading to a wrong result.
The "smooth" predictors (tas, tas2, hfds) are averaged over the ensemble members (i.e. are all the same for all ensemble members) - so for them it does not matter.
This seems to have a small influence on the params of the forced response, but gives the residual variability more predictive power (unsurprisingly). I cannot check this for thoroughly but will add the example where I found it.
separate_hist_future is also used in train_gt, but that does not matter, because the hist simulations are then averaged-over.

The story is: I replicated a workflow with several ensemble members outside of "legacy" MESMER and the params of the "local forced response" did not line up. It took way too long comparing and trying to re-order the targs and preds of my and Lea's version. Only when I was able to rule out any error on my side, did it make click.

This was a bitch to figure out - I know that np.unique sorts its output, but I never thought about its implications, when looking at separate_hist_future.

cc @leabeusch (FYI) and @znicholls who might find this interesting.

The text was updated successfully, but these errors were encountered:

mathause · 2023-11-19T01:37:53Z

And to complete this, an example:

import numpy as np
arr = np.array([1, 0, 1, 5, 2])
np.sort(arr)
# array([0, 1, 2, 5])

And potential bugfix

__, idx = np.unique(arr, return_index=True)
arr[np.sort(idx)]
# array([1, 0, 5, 2])

znicholls · 2023-11-21T12:47:42Z

Nice find! I can fully imagine that it was a huge pain to work all this out.

I replicated a workflow with several ensemble members outside of "legacy" MESMER and the params of the "local forced response" did not line up

Are you building MESMER again elsewhere? We've started doing these sort of re-writes more and more in other stuff we maintain and have found them incredibly helpful. I'd be interested to hear what the plan is here (but no pressure, you have lots of things happening).

mathause added the bug Something isn't working label Nov 19, 2023

mathause mentioned this issue Nov 19, 2023

ensure un-sorted hist de-duplication #339

Merged

4 tasks

mathause added the breaking-change label Nov 27, 2023

mathause mentioned this issue Jan 3, 2024

require pandas v2 #369

Merged

4 tasks

veni-vidi-vici-dormivi added this to the v1.0.0 milestone Jun 21, 2024

veni-vidi-vici-dormivi removed the bug Something isn't working label Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using `np.unique` to de-duplicate historical ensembles shuffles them #338

using `np.unique` to de-duplicate historical ensembles shuffles them #338

mathause commented Nov 19, 2023

mathause commented Nov 19, 2023

znicholls commented Nov 21, 2023

using np.unique to de-duplicate historical ensembles shuffles them #338

using np.unique to de-duplicate historical ensembles shuffles them #338

Comments

mathause commented Nov 19, 2023

mathause commented Nov 19, 2023

znicholls commented Nov 21, 2023

using `np.unique` to de-duplicate historical ensembles shuffles them #338

using `np.unique` to de-duplicate historical ensembles shuffles them #338