Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: standardize fill_value behavior across the API #15533

Open
ResidentMario opened this issue Feb 28, 2017 · 16 comments
Open

ENH: standardize fill_value behavior across the API #15533

ResidentMario opened this issue Feb 28, 2017 · 16 comments
Labels
API - Consistency Internal Consistency of API/Behavior Dtype Conversions Unexpected or buggy dtype conversions Enhancement Error Reporting Incorrect or improved errors from pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@ResidentMario
Copy link
Contributor

ResidentMario commented Feb 28, 2017

Problem

In the PR for #15486, I found that type validation for the fill_value parameters strewn across a large number of pandas API methods is done ad-hoc. This results in a wide variety of possible accepted inputs. I think it would be good to standardize this so that all of these methods use the same behavior, the one currently used by fillna.

Implementation Details

Partially the point of providing a fill_value is to avoid having to do a slow-down type conversion otherwise (using .fillna().astype()). However, specifying other formats is nevertheless a useful convenience to have. Implementation would roughly be:

Before executing the rest of the method body, check whether or not the fill_value is valid (using a centralized maybe_fill method). If it is not, throw a ValueError. If it is, check whether or not incorporating the fill_value would result in an upcast in the column dtype. If it would not, follow a code path where the column never gets type-converted. If it would, follow that same code path, then do something like a filla operation at the end before returning.

Target Implementation

The same as what fillna currently does. Which follows.

Invalid:

  • categorical fill for a category not in the categories will raise a ValueError.
  • sparse matrices refuse upcasting.
  • Passing an object or list or other non-coercable "thing" as a fill.

Valid, upcast:

  • int fill will promote bool dtypes to int.
  • float fill will promote int and bool dtypes to float (this is what happens with np.nan already).
  • object (str) fill would promote lesser dtypes to object.
  • int, float, and bool fill to a datetime dtype will be treated as a UNIX-like timestamp and promoted to datetime.
  • object fill will promote datetime dtype to object.

Valid, no-cast:

  • Everything else.

Current Implementation

...is ad-hoc. The following are the methods which currently provide a fill_value input, as well as where they deviate from the model above.

  • Series.combine, DataFrame.combine, Series.to_sparse: These are unique usages of fill_value which aren't compatible with the rest of them.

  • Series.unstack, DataFrame.unstack: any fill_value is allowed. You can pass an object if you'd like, or even another DataFrame (yo dawg...).

  • DataFrame.align: Any fill_value is allowed.

  • DataFrame.reindex_axis: Lists and dicts are allowed, objects are not.

  • DataFrame.asfreq, Series.asfreq: any fill_value is allowed.

  • pd.pivot_table: ...

  • Series.add, DataFrame.add: ...

  • Series.subtract, DataFrame.substract: ...

  • Probably others, there's a lot of these.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2017

this is kind of like _maybe_promote, see here: https://github.com/pandas-dev/pandas/blob/master/pandas/types/cast.py#L230, though in this case its a validator, but same idea.

a lot of the validation is prob occuring now but at a lower level and with no consistency of error messages. A lot of the routines expect certain types for filling, IOW filling floats needs a compatible float/int (or would raise).

So a friendly high level check would be nice. The hard part about this issue is not the code changes, but the tests :>

Also collecting these tests into a standard place would be fine as well (this is tricky because we like to keep the with the types, e.g. in pandas/tests/series, but for example routines in pandas/tests/missing would be nice as well.

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Difficulty Intermediate labels Feb 28, 2017
@jreback jreback added this to the Next Major Release milestone Feb 28, 2017
@jreback jreback added the Error Reporting Incorrect or improved errors from pandas label Feb 28, 2017
@ResidentMario
Copy link
Contributor Author

I'm hopeful I can figure out how to implement. Why are the tests the hard part? I assume you mean figuring out where to put them, which does sound challenging...might collect them in a separate file instead, just for the meantime.

@ResidentMario
Copy link
Contributor Author

ResidentMario commented Mar 4, 2017

_maybe_upcast doesn't have guards against upcasting "weird" stuff. So for example the following is legal:

_maybe_upcast(np.array([np]))

When a fill_value parameter is passed, _maybe_upcast is the first and only validation step that parameter has to go through. So since the above is legal, so is whatever garbage you pass it, e.g. Series.shift(fill_value=<class 'garbage_type'>).

In other cases (e.g. fillna) there is additional validation that prevents this from happening.

Should _maybe_upcast (continue to) allow this behavior? This is deep in the internals, so I suspect not touching it would be best, but it does seem like an odd thing to allow, to me.

It wouldn't be too hard to add a separate check to prevent this sort of input from reaching _maybe_upcast at all.

@ResidentMario
Copy link
Contributor Author

(sorry about the close/open, fat-fingered the wrong button there)

@jreback
Copy link
Contributor

jreback commented Mar 4, 2017

you might be able to add a check here
though i suspect have a _maybe_cast_fill which does validation might be easier

internal routines just have implicit (or better yet explicit guarantees that are in the docstring)

@ResidentMario
Copy link
Contributor Author

ResidentMario commented Mar 5, 2017

FYI, this is legal in fillna right now:

pd.Series([1, 2, np.nan]).fillna(lambda f: f)

Which is counter-factual w.r.t a (separate) TypeError statement in the method body:

    if isinstance(value, (list, tuple)):
        raise TypeError('"value" parameter must be a scalar or dict, but '
                        'you passed a "{0}"'.format(type(value).__name__))

(to fix this you could do if isinstance(value, (list, tuple)) or callable(value))

@jreback
Copy link
Contributor

jreback commented Mar 5, 2017

yeah you prob have to check inclusion rather than exclusion

e.g. is_scalar, is_dict_like, is_list_like

@ResidentMario
Copy link
Contributor Author

Yeah. Funny little bug with that:

 >>> import import pandas.core.common as com
 >>> com.is_string_dtype(type)
 True

@ResidentMario
Copy link
Contributor Author

In com.is_string_type:

dtype.kind in ('O', 'S', 'U')

The type checker naively assumes that if you passed it an object, it must have been a string!

@jreback
Copy link
Contributor

jreback commented Mar 5, 2017

is_string_dtype is not strict. It really can't be w/o a lot of code inference (which is not cheap). You can certainly add a comment to it if you'd like. If you really need inference then you can do lib.infer_type which IS strict. (but again is not free, though not too bad as it short-circuits).

@jreback
Copy link
Contributor

jreback commented Mar 5, 2017

@ResidentMario note that imports from pandas.core.common are almost all deprecated, use pandas.types.common

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 9, 2017

@ResidentMario Is the description at the top of this issue still up to date with how you are trying to implement things in #15587 ?

@ResidentMario
Copy link
Contributor Author

Hmm. This list is incomplete, and I think there's been a couple of changes there:

  • Period is O dtype, the implementation there uses object rules for Period columns because of that. (probably just need to investigate this further?)
  • I'm being a bit more strict with only allowing datetime types to datetime64[ns] columns, not numerical types (so no int, float, etc.).

A big question right now is whether or not in the case of a DataFrame we want to validate column-by-column or consolidate the dtype (probably into object) and use the rules for filling that instead.

@jreback
Copy link
Contributor

jreback commented Mar 9, 2017

Period is O dtype, the implementation there uses object rules for Period columns because of that. (probably just need to investigate this further?)

yes this is a special case atm, you you can simply use is_period_arraylike on an object column to check, and if true, then restrict the fill value.

I'm being a bit more strict with only allowing datetime types to datetime64[ns] columns, not numerical types (so no int, float, etc.).

yes

@jreback
Copy link
Contributor

jreback commented Mar 9, 2017

A big question right now is whether or not in the case of a DataFrame we want to validate column-by-column or consolidate the dtype (probably into object) and use the rules for filling that instead.

I think a reasonable way to do this is to:

  • add a errors='ignore'|'raise'|'force' kw to .fillna* routines
  • if errors='ignore' (default), then allow filling of a column only if the types match (IOW don't fill datetimes with ints, just skip them)
  • if errors='raise' then raise on anything that is not compat with the filling
  • if errors='force' then coerce the columns as needed (even to object).

this would give nice behavior by default of filling things that can take that value and providing error checking otherwise (with an option for force filling, but that's user selected).

The current situation is effectively errors='force'.

In [2]: df = DataFrame({'A':[1,2,3],'B':pd.date_range('20130101',periods=3)})

In [3]: df
Out[3]: 
   A          B
0  1 2013-01-01
1  2 2013-01-02
2  3 2013-01-03

In [4]: df.iloc[1] = np.nan

In [5]: df
Out[5]: 
     A          B
0  1.0 2013-01-01
1  NaN        NaT
2  3.0 2013-01-03

In [6]: df.fillna(0)
Out[6]: 
     A          B
0  1.0 2013-01-01
1  0.0 1970-01-01
2  3.0 2013-01-03

In [7]: df.fillna(pd.Timestamp('20130110'))
Out[7]: 
                     A          B
0                    1 2013-01-01
1  2013-01-10 00:00:00 2013-01-10
2                    3 2013-01-03

I suppose we could also make the default errors='raise', though not back-compat . This would be more obvious. errors='ignore' is more convenient.

@ResidentMario
Copy link
Contributor Author

ResidentMario commented Mar 10, 2017

Ok so then:

  1. New PR implementing an errors param for fillna via a new validator func (BUG: fillna('') does not replace NaT #11953).
  2. PR implementing that validator func in the various fill_value routines with the default validator func behavior (ENH: standardize fill_value behavior across the API #15533; PR#15587).
  3. PR adding the errors param to shift (ENH: fill_value argument for shift #15486; PR#15527).

I suggest also adding a new pd.set_option param for letting the user pick their error coercion mode if they want.

@jbrockmendel jbrockmendel added the API - Consistency Internal Consistency of API/Behavior label Dec 19, 2019
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Dtype Conversions Unexpected or buggy dtype conversions Enhancement Error Reporting Incorrect or improved errors from pandas Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants