Postprocessing step np.nan_to_num #39

cruyffturn · 2024-01-04T19:46:44Z

Hello thank you for providing us with this package. In the post processing step of acs.ACSIncome np.nan_to_num is applied with the second argument -1. Second argument of np.nan_to_num is [copy] (https://numpy.org/doc/stable/reference/generated/numpy.nan_to_num.html) which is a boolean variable. Does -1 implies copy=False or the goal was to replace missing values with -1?

N-Masi · 2024-01-07T06:35:46Z

-1 implies copy=True (can test this with assert bool(-1)). copy=True is the intended behavior for the postprocessing by the authors as you can see on folktables.py:101. This leads me to believe it is coincident that -1 as the second arg causes a different intended behavior (copying) and that the authors meant for nan's to be replaced by -1, which would be achieved with lambda x: np.nan_to_num(x, nan=-1) (this by default also has copying)

The same postprocess (lambda x: np.nan_to_num(x, -1)) is applied to all BasicProblem's defined in acs.py which would changes the results of using the package. When you change the postprocessing of ACSEmployment to nan=-1 the equality of opportunity violation reported in the first README example for Alabama changes from 0.0871 to 0.0789.

Because the values of ACS variables are largely categorical I think there should maybe be more discussion around what to do with NaN values rather than just replace with -1. Perhaps entries with NaN values for some variable should be thrown out? If the answer is to just let the user decide then discussion of this question and alternative postprocessing functions with their implications is definitely warranted in the README.

mrtzh · 2024-05-01T12:02:34Z

Right, I just stumbled over this while reviewing #41

If I had to guess what happened, I'd say that the way we called nan_to_num was unintended. It has the effect that NaNs were replaced with 0 and not with -1.

I'll have to think more about what's the best solution for dealing with nans. Given that this package has been used for quite some time, my inclination might be to continue to replace NaNs with 0 to ensure replicability of various results that might depend on it.

AndreFCruz · 2024-05-05T12:14:03Z

Personally I think it'd make sense to keep the previous unintended behavior (replace with 0s instead of -1s) to maintain reproducibility of results 😅

One potential issue with replacing NaNs with 0s is that 0 is a valid value that is present in multiple columns.

I looked into it and only ACSHealthInsurance seems to have this problem in a significant part of the dataset (17% NaNs and 10% 0s are mapped into the same 0 representation). Other tasks have plenty of NaNs but no 0s.

See the last cell of this notebook for a more detailed investigation of which tasks and columns are affected.

Since only one task seems to be affected by this I suggest keeping the previous behavior and adding a warning in the package repository. Also, the implicit NaN-to-0 mapping should perhaps be made explicit with nan_to_num(x, copy=True, nan=0.0) instead of hiding behind the default value of nan_to_num.

ericvd-ucb · 2024-05-05T15:29:29Z

FWIW the teachability of this is awesome. If we could find some results that changed due to the treatment of the nans that would be so great for a teaching lesson eg on data processing decisions affecting results.

mrtzh assigned mrtzh and AndreFCruz May 1, 2024

mrtzh added issue and removed issue labels May 1, 2024

mrtzh mentioned this issue May 1, 2024

Little refactor and minor bug #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postprocessing step np.nan_to_num #39

Postprocessing step np.nan_to_num #39

cruyffturn commented Jan 4, 2024

N-Masi commented Jan 7, 2024 •

edited

Loading

mrtzh commented May 1, 2024

AndreFCruz commented May 5, 2024

ericvd-ucb commented May 5, 2024

Postprocessing step np.nan_to_num #39

Postprocessing step np.nan_to_num #39

Comments

cruyffturn commented Jan 4, 2024

N-Masi commented Jan 7, 2024 • edited Loading

mrtzh commented May 1, 2024

AndreFCruz commented May 5, 2024

ericvd-ucb commented May 5, 2024

N-Masi commented Jan 7, 2024 •

edited

Loading