Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postprocessing step np.nan_to_num #39

Open
cruyffturn opened this issue Jan 4, 2024 · 4 comments
Open

Postprocessing step np.nan_to_num #39

cruyffturn opened this issue Jan 4, 2024 · 4 comments
Assignees

Comments

@cruyffturn
Copy link

Hello thank you for providing us with this package. In the post processing step of acs.ACSIncome np.nan_to_num is applied with the second argument -1. Second argument of np.nan_to_num is [copy] (https://numpy.org/doc/stable/reference/generated/numpy.nan_to_num.html) which is a boolean variable. Does -1 implies copy=False or the goal was to replace missing values with -1?

@N-Masi
Copy link

N-Masi commented Jan 7, 2024

-1 implies copy=True (can test this with assert bool(-1)). copy=True is the intended behavior for the postprocessing by the authors as you can see on folktables.py:101. This leads me to believe it is coincident that -1 as the second arg causes a different intended behavior (copying) and that the authors meant for nan's to be replaced by -1, which would be achieved with lambda x: np.nan_to_num(x, nan=-1) (this by default also has copying)

The same postprocess (lambda x: np.nan_to_num(x, -1)) is applied to all BasicProblem's defined in acs.py which would changes the results of using the package. When you change the postprocessing of ACSEmployment to nan=-1 the equality of opportunity violation reported in the first README example for Alabama changes from 0.0871 to 0.0789.

Because the values of ACS variables are largely categorical I think there should maybe be more discussion around what to do with NaN values rather than just replace with -1. Perhaps entries with NaN values for some variable should be thrown out? If the answer is to just let the user decide then discussion of this question and alternative postprocessing functions with their implications is definitely warranted in the README.

@mrtzh
Copy link
Member

mrtzh commented May 1, 2024

Right, I just stumbled over this while reviewing #41

If I had to guess what happened, I'd say that the way we called nan_to_num was unintended. It has the effect that NaNs were replaced with 0 and not with -1.

I'll have to think more about what's the best solution for dealing with nans. Given that this package has been used for quite some time, my inclination might be to continue to replace NaNs with 0 to ensure replicability of various results that might depend on it.

@AndreFCruz
Copy link
Member

Personally I think it'd make sense to keep the previous unintended behavior (replace with 0s instead of -1s) to maintain reproducibility of results 😅

One potential issue with replacing NaNs with 0s is that 0 is a valid value that is present in multiple columns.

I looked into it and only ACSHealthInsurance seems to have this problem in a significant part of the dataset (17% NaNs and 10% 0s are mapped into the same 0 representation). Other tasks have plenty of NaNs but no 0s.

See the last cell of this notebook for a more detailed investigation of which tasks and columns are affected.

Since only one task seems to be affected by this I suggest keeping the previous behavior and adding a warning in the package repository. Also, the implicit NaN-to-0 mapping should perhaps be made explicit with nan_to_num(x, copy=True, nan=0.0) instead of hiding behind the default value of nan_to_num.

@ericvd-ucb
Copy link

FWIW the teachability of this is awesome. If we could find some results that changed due to the treatment of the nans that would be so great for a teaching lesson eg on data processing decisions affecting results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants