Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: distinguish NA vs NaN in floating dtypes #32265

Open
jorisvandenbossche opened this issue Feb 26, 2020 · 121 comments
Open

API: distinguish NA vs NaN in floating dtypes #32265

jorisvandenbossche opened this issue Feb 26, 2020 · 121 comments
Labels
API Design Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 26, 2020

Context: in the original pd.NA proposal (#28095) the topic about pd.NA vs np.nan was raised several times. And also in the recent pandas-dev mailing list discussion on pandas 2.0 it came up (both in context of np.nan for float as pd.NaT for datetime-like).

With the introduction of pd.NA, and if we want consistent "NA behaviour" across dtypes at some point in the future, I think there are two options for float dtypes:

  • Keep using np.nan as we do now, but change its behaviour (e.g. in comparison ops) to match pd.NA
  • Start using pd.NA in float dtypes

Personally, I think the first one is not really an option. Keeping it as np.nan, but deviating from numpy's behaviour feels like a non-starter to me. And it would also give a discrepancy between the vectorized behaviour in pandas containers vs the scalar behaviour of np.nan.
For the second option, there are still multiple ways this could be implemented (a single array that still uses np.nan as the missing value sentinel but we convert this to pd.NA towards the user, versus a masked approach like we do for the nullable integers). But in this issue, I would like to focus on the user-facing behaviour we want: Do we want to have both np.nan and pd.NA, or only allow pd.NA? Should np.nan still be considered as "missing" or should that be optional? What to do on conversion from/to numpy? (And the answer to some of those questions will also determine which of the two possible implementations is preferrable)


Actual discussion items: assume we are going to add floating dtypes that use pd.NA as missing value indicator. Then the following question comes up:

If I have a Series[float64] could it contain both np.nan and pd.NA, and these signify different things?

So yes, it is technically possible to have both np.nan and pd.NA with different behaviour (np.nan as "normal", unmasked value in the actual data, pd.NA tracked in the mask). But we also need to decide if we want this.

This was touchec upon a bit in the original issue, but not really further discussed. Quoting a few things from the original thread in #28095:

[@Dr-Irv in comment] I think it is important to distinguish between NA meaning "data missing" versus NaN meaning "not a number" / "bad computational result".

vs

[@datapythonista in comment] I think NaN and NaT, when present, should be copied to the mask, and then we can forget about them (for what I understand values with True in the NA mask won't be ever used).

So I think those two describe nicely the two options we have on the question do we want both pd.NA and np.nan in a float dtype and have them signify different things? -> 1) Yes, we can have both, versus 2) No, towards the user, we only have pd.NA and "disallow" NaN (or interpret / convert any NaN on input to NA).

A reason to have both is that they can signify different things (another reason is that most other data tools do this as well, I will put some comparisons in a separate post).
That reasoning was given by @Dr-Irv in #28095 (comment): there are times when I get NaN as a result of a computation, which indicates that I did something numerically wrong, versus NaN meaning "missing data". So should there be separate markers - one to mean "missing value" and the other to mean "bad computational result" (typically 0/0) ?

A dummy example showing how both can occur:

>>>  pd.Series([0, 1, 2]) / pd.Series([0, 1, pd.NA])
0    NaN
1    1.0
2   <NA>
dtype: float64

The NaN is introduced by the computation, the NA is propagated from the input data (although note that in an arithmetic operation like this, NaN would also propagate).

So, yes, it is possible and potentially desirable to allow both pd.NA and np.nan in floating dtypes. But, it also brings up several questions / complexities. Foremost, should NaN still be considered as missing? Meaning, should it be seen as missing in functions like isna/notna/dropna/fillna ? Or should that be an option? Should NaN still be considered as missing (and thus skipped) in reducing operations (that have a skipna keyword, like sum, mean, etc)?

Personally, I think we will need to keep NaN as missing, or at least initially. But, that will also introduce inconsistencies: although NaN would be seen as missing in the methods mentioned above, in arithmeric / comparison / scalar ops, it would behave as NaN and not as NA (so eg comparison gives False instead of propagating). It also means that in the missing-related methods, we will need to check for both NaN in the values as the mask (which can also have performance implications).


Some other various considerations:

  • Having both pd.NA and NaN (np.nan) might actually be more confusing for users.

  • If we want a consistent indicator and behavior for missing values across dtypes, I think we need a separate concept from NaN for float dtypes (i.e. pd.NA). Changing the behavior of NaN when inside a pandas container seems like a non-starter (the behavior of NaN is well defined in IEEE 754, and it would also deviate from the underlying numpy array)

  • How do we handle compatibility with numpy?
    The solution that we have come up (for now) for the other nullable dtypes is to convert to object dtype by default, and have a to_numpy(.., na_value=np.nan) explicit conversion.
    But given how np.nan is in practice used in the whole pydata ecosystem as a missing value indicator, this might be annoying.

    For conversion to numpy, see also some relevant discussion in API: how to handle NA in conversion to numpy arrays #30038

  • What with conversion / inference on input?
    Eg creating a Series from a float numpy array with NaNs (pdSeries(np.array([0.1, np.nan]))) Do we convert NaNs to NA automatically by default?

cc @pandas-dev/pandas-core @Dr-Irv @dsaxton

@jorisvandenbossche jorisvandenbossche added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate API Design NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Feb 26, 2020
@jorisvandenbossche
Copy link
Member Author

How do other tools / languages deal with this?

Julia has both as separate concepts:

julia> arr = [1.0, missing, NaN]
3-element Array{Union{Missing, Float64},1}:
   1.0     
    missing
 NaN       

julia> ismissing.(arr)
3-element BitArray{1}:
 false
  true
 false

julia> isnan.(arr)
3-element Array{Union{Missing, Bool},1}:
 false       
      missing
  true       

R also has both, but will treat NaN as missing in is.na(..):

> v <- c(1.0, NA, NaN)
> v
[1]   1  NA NaN
> is.na(v)
[1] FALSE  TRUE  TRUE
> is.nan(v)
[1] FALSE FALSE  TRUE

Here, the "skipna" na.rm keyword also skips NaN (na.rm docs: "logical. Should missing values (including NaN) be removed?"):

> sum(v)
[1] NA
> sum(v, na.rm=TRUE)
[1] 1

Apache Arrow also has both (NaN can be a float value, while it tracks missing values in a mask). It doesn't yet have much computational tools, bug eg the sum function skips missing values by default but will propagate NaN (like numpy's sum does for float NaN).

I think SQL also has both, but didn't yet check in more detail how it handles NaN in missing-like operations.

@toobaz
Copy link
Member

toobaz commented Feb 26, 2020

I still don't know the semantics of pd.NA enough to judge in detail, but I am skeptical on whether users do benefit from two distinct concepts. If as a user I divide 0 by 0, it's perfectly fine to me to consider the result as "missing". Even more so because when done in non-vectorized Python, it raises an error, not returning some "not a number" placeholder. I suspect the other languages (e.g. at least R) have semantics which are more driven by implementation than by user experience. And definitely I would have a hard time suggesting "natural" ways in which the propagation of pd.NA and np.nan should differ.

So ideally pd.NA and np.nan should be the same to users. If, as I understand, this is not possible given how pd.NA was designed and the compatibility we want to (rightfully) keep with numpy, I think the discrepancies should be limited as much as possible.

@toobaz
Copy link
Member

toobaz commented Feb 26, 2020

Just to provide an example: I want to compute average hourly wages from two variables: monthly hours worked and monthly salary. If for a given worker I have 0 and 0, in my average I will want to disregard this observation precisely as if it was a missing value. In this and many other cases, missing observations are the result of float operations.

@TomAugspurger
Copy link
Contributor

Keeping it as np.nan, but deviating from numpy's behaviour feels like a non-starter to me.

Agreed.

do we want both pd.NA and np.nan in a float dtype and have them signify different things?

My initial preference is for not having both. I think that having both will be confusing for users (and harder to maintain).

@jreback
Copy link
Contributor

jreback commented Feb 26, 2020 via email

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 26, 2020

Just to provide an example: I want to compute average hourly wages from two variables: monthly hours worked and monthly salary. If for a given worker I have 0 and 0, in my average I will want to disregard this observation precisely as if it was a missing value. In this and many other cases, missing observations are the result of float operations.

On the other hand, such a calculation could indicate something wrong in the data that you need to identify and fix. I've had cases where the source data (or some other calculation) I did produced a NaN, which pandas treats as missing, and the true source of the problem was either back in the source data (e.g., that data should not have been missing) or a bug elsewhere in my code. So in these cases, where the NaN was introduced due to a bug in the source data or in my code, my later calculations were perfectly happy because to pandas, the NaN meant "missing". Finding this kind of bug is non-trivial.

I think we should support np.nan and pd.NA. To me, the complexity is in a few places:

  1. The transition for users so they know that np.nan won't mean "missing" in the future needs to be carefully thought out. Maybe we consider a global option to control this behavior?
  2. Going back and forth between pandas and numpy (and maybe other libraries). If we eventually have np.nan and pd.NA mean "Not a number" and "missing", respectively, and numpy (or another library) treats np.nan as "missing", do we automate the conversions (both going from pandas to numpy/other or ingesting from numpy/other into pandas)

We currently also have this inconsistent (IMHO) behavior which relates to (2) above:

>>> s=pd.Series([1,2,pd.NA], dtype="Int64")
>>> s
0       1
1       2
2    <NA>
dtype: Int64
>>> s.to_numpy()
array([1, 2, <NA>], dtype=object)
>>> s
0       1
1       2
2    <NA>
dtype: Int64
>>> s.astype(float).to_numpy()
array([ 1.,  2., nan])

@toobaz
Copy link
Member

toobaz commented Feb 26, 2020

On the other hand, such a calculation could indicate something wrong in the data that you need to identify and fix.

Definitely. To me, this is precisely the role of pd.NA - or anything denoting missing data. If you take a monthly average of something that didn't happen in a given month, it is missing, not a sort of strange floating number. Notice I'm not claiming the two concepts are the same, but just that there is no clear-cut distinction, and even less some natural one for users.

to pandas, the NaN meant "missing"

Sure. And I think we have all the required machinery to behave as the user desires on missing data (mainly, the skipna argument).

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 26, 2020

On the other hand, such a calculation could indicate something wrong in the data that you need to identify and fix.

Definitely. To me, this is precisely the role of pd.NA - or anything denoting missing data. If you take a monthly average of something that didn't happen in a given month, it is missing, not a sort of strange floating number. Notice I'm not claiming the two concepts are the same, but just that there is no clear-cut distinction, and even less some natural one for users.

When I said "such a calculation could indicate something wrong in the data that you need to identify and fix.", the thing that could be wrong in the data might not be missing data. It could be that some combination of values occurred that were not supposed to happen.

There are just two use cases here. One is where the data truly has missing data, like your example of the monthly average. The second is where all the data is there, but some calculation you did creates a NaN unexpectedly, and that indicates a different kind of bug.

to pandas, the NaN meant "missing"

Sure. And I think we have all the required machinery to behave as the user desires on missing data (mainly, the skipna argument).

Yes, but skipna=True is the default everywhere, so your solution would mean that you have to always use skipna=False to detect those kinds of errors.

@toobaz
Copy link
Member

toobaz commented Feb 26, 2020

One is where the data truly has missing data, like your example of the monthly average. The second is where all the data is there, but some calculation you did creates a NaN unexpectedly, and that indicates a different kind of bug.

My point is precisely that in my example missing data causes a 0 / 0. But it really originates from missing data. Could 0/0 result in pd.NA? Well, we would deviate not just from numpy, but also from a large number of cases in which 0/0 does not originate from missing data.

@toobaz
Copy link
Member

toobaz commented Feb 26, 2020

Yes, but skipna=True is the default everywhere, so your solution would mean that you have to always use skipna=False to detect those kinds of errors.

This is true. But... are there new usability insights compared to those we had back in 2017?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 26, 2020

My point is precisely that in my example missing data causes a 0 / 0. But it really originates from missing data. Could 0/0 result in pd.NA? Well, we would deviate not just from numpy, but also from a large number of cases in which 0/0 does not originate from missing data.

That's why I think having np.nan representing "bad calculation" and pd.NA represent "missing" is the preferred behavior. But I'm one voice among many.

@shoyer
Copy link
Member

shoyer commented Feb 26, 2020

That's why I think having np.nan representing "bad calculation" and pd.NA represent "missing" is the preferred behavior. But I'm one voice among many.

+1 for consistency with other computational tools.

On the subject of automatic conversion into NumPy arrays, return an object dtype array seems consistent but could be a very poor user experience. Object arrays are really slow, and break many/most functions that expect numeric NumPy arrays. Float dtype with auto-conversion from NA -> NaN would probably be preferred by users.

@dsaxton
Copy link
Member

dsaxton commented Feb 26, 2020

I think using NA even for missing floats makes a lot of sense. In my opinion the same argument that NaN is semantically misleading for missing strings applies equally well to numeric data types.

It also seems trying to support both NaN and NA might be too complex and could be a significant source of confusion (I would think warnings / errors are the way to deal with bad computations rather than a special value indicating "you shouldn't have done this"). And if we're being pedantic NaN doesn't tell you whether you're dealing with 0 / 0 or log(-1), so it's technically still NA. :)

@jbrockmendel
Copy link
Member

And if we're being pedantic NaN doesn't tell you whether you're dealing with 0 / 0 or log(-1), so it's technically still NA.

I propose that from now on we use a branch of log with a branch cut along the positive imaginary axis, avoiding this problem entirely.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Feb 29, 2020

Thanks all for the discussion!

[Pietro] And definitely I would have a hard time suggesting "natural" ways in which the propagation of pd.NA and np.nan should differ.

I think there is, or at least, we now have one: for the new pd.NA, we decided that it propagates in comparisons, while np.nan gives False in comparisons (based on numpy behaviour, based on floating spec). Whether this is "natural" I don't know, but I think it is somewhat logical to do.

[Jeff] this introduces enormous mental complexity; now I have 2 missing values?

Note that it's not necessarily "2 missing values", but rather a "missing value" and a "not a number". Of course, current users are used to see NaN as a missing value. For those, there is of course initial confusion to no longer see NaN as a missing value. And this is certainly an aspect not to underestimate.

[Irv] Maybe we consider a global option to control this behavior?

There is already one for infinity (which is actually very similar to NaN, see more below): pd.options.mode.use_inf_as_na (default False). We could have a similar one for NaN (or a combined one).

[Stephan] +1 for consistency with other computational tools.

Yes, I agree it would be nice to follow numpy for those cases that numpy handles (which is things that result in NaN, like 0/0). Having different behaviour for pd.NA is fine I think (like the different propagation in comparison ops), since numpy doesn't have that concept (so we can't really "deviate" from numpy).


From talking with @TomAugspurger and looking at examples, I somewhat convinced myself that making the distinction makes sense (not sure if it convinced @TomAugspurger also, though ;), and there are still a lot of practical concerns)
Consider the following example:

>>> s = pd.Series([0, 1, 2]) / pd.Series([0, 0, pd.NA], dtype="Int64")  
>>> s   
0    NaN
1    inf
2    NaN
dtype: float64

>>> s.isna()
0     True
1    False
2     True
dtype: bool

The above is the current behaviour (where the original NA from Int64 dtype also gives NaN in float, but with a potential new float dtype, the third value would be instead of NaN).
So here, 0 / 0 gives NaN, which is considered missing, while 1 / 0 gives inf, which is not considered missing. Is there a good reason for that difference? And did we in practice get much complaints or have we seen much user confusion about 1 / 0 resulting in inf and not being regarded as missing?

Based on that, I think the following (hypothetical) behaviour actually makes sense:

>>> s = pd.Series([0, 1, 2]) / pd.Series([0, 0, pd.NA], dtype="Int64")  
>>> s   
0     NaN
1     inf
2    <NA>
dtype: float64

>>> s.isna()
0    False
1    False
2     True
dtype: bool

As long as we ensure when creating a new "nullable float" series, that missing values (NA) are used and not NaN (unless the user explicitly asks for that), I think most users won't often run into having a NaN, or not that much more often than Inf (which already has the "non-missing" behaviour).

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Feb 29, 2020

On the subject of automatic conversion into NumPy arrays, return an object dtype array seems consistent but could be a very poor user experience. Object arrays are really slow, and break many/most functions that expect numeric NumPy arrays. Float dtype with auto-conversion from NA -> NaN would probably be preferred by users.

@shoyer I agree the object dtype is poor user experience. I think we opted (for now) for object dtype, since this is kind of the most "conservative" option: it at least "preserves the information", although in such a mostly useless way that it's up to the user to decide how to convert it properly.
But indeed in most cases, users will then probably need to do .to_numpy(float, na_value=np.nan) (eg that's what scikit-learn will need to do). And if that is what most users will need, shouldn't it just be the default? I find this a hard one .. (as on the other hand, it's also not nice that the default array you get from np.asarray(..) has quite different behaviour for the NaNs compared to the original NAs).

Another hard topic, in case we no longer see np.nan as missing in a new nullable float dtype, will be how to treat nans in numpy arrays.
For example, what should pd.isna(np.array([np.nan], dtype=float) do? What should pd.Series(np.array([np.nan]), dtype=<nullable float>) do?
For the conversion from numpy array to Series, I think the default should be to convert NaNs to NA (since most people will have their missing values as NaN in numpy arrays, and so want it as NA in pandas). But if we do that, it would be strange that pd.isna would not return True for np.nan in a numpy array. But if returning True in that case, that would then conflict with returning False for np.nan if being in a nullable Series ...

@toobaz
Copy link
Member

toobaz commented Feb 29, 2020

I think there is, or at least, we now have one: for the new pd.NA, we decided that it propagates in comparisons, while np.nan gives False in comparisons (based on numpy behaviour, based on floating spec). Whether this is "natural" I don't know, but I think it is somewhat logical to do.

My opinion is that the new pd.NA behaves under this respect in a more "natural" way than the floating spec - at least in a context in which users work with several different dtypes. Hence I respect the decision to deviate. I just would limit the deviation as much as possible. To be honest (but that's maybe another discussion, and I didn't think much about the consequences) I would be tempted to completely eliminate np.nan from floats (replace with pd.NA), to solve this discrepancy (even at the cost of deviating from numpy).

Consider the following example:

Actually, your example reinforces my opinion on not making the distinction (where possible).

So here, 0 / 0 gives NaN, which is considered missing, while 1 / 0 gives inf, which is not considered missing. Is there a good reason for that difference?

In [2]: pd.Series([-1, 0, 1]) / pd.Series([0, 0, 0])                                                                                                                                                                                                                                                                                                                       
Out[2]: 
0   -inf
1    NaN
2    inf
dtype: float64

1 / 0 gives inf: this clearly suggests about some limit (to 0 from right); -1 / 0 gives -inf: same story; 0 / 0 gives NaN. Why? Clearly because depending on how you converge to 0 in the numerator, you could have 0, inf, or any finite number. So this NaN really talks about missing information, not about some "magic" or "unrepresentable" floating point number. Same holds with np.inf + np.inf vs. np.inf - np.inf. Compare to np.log([-1, 1]), which produces NaN not because any information is missing, but because the result is not representable as a real number.

What I mean is: in the floating specs, NaN already denotes two different cases: of missing information, and of "unfeasible [within real numbers]" operation (together with any combinations of those - in particular when you propagate NaNs).

I know we all have in mind the distinction "I find missing values in my data" vs. "I produce missing values while handling the data". But this is a dangerous distinction to shape API, because the data which is an input for someone was an input for someone else. Are we really saying that if I do 0/0 it is "not a number", while if my data provider does exactly the same thing before passing me the data it is "missing data" that should behave differently?! What if at some step of my data pipeline... I am my data provider? Should we make np.NaN persist as pd.NA any time we save data to disk?!

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Feb 29, 2020

[about the NaN as result from 0/0] So this NaN really talks about missing information, not about some "magic" or "unrepresentable" floating point number.

Sorry @toobaz, I don't understand your reasoning here (or just disagree, that's also possible). 0 and 0 are clearly both non-missing values in your data, so for me this "clearly" is not a case of talking missing information, but rather an unrepresentable floating point number. 0 and 0 can both be perfectly valid values in both series, it's only their combination and the specific operation that makes them invalid.

Also, you then say that np.log([-1]) gives a NaN not because of missing information. So would you then propose to have 0/0 become pd.NA but keep np.log(-1) as resulting in np.nan?

Are we really saying that if I do 0/0 it is "not a number", while if my data provider does exactly the same thing before passing me the data it is "missing data" that should behave differently?! What if at some step of my data pipeline... I am my data provider? Should we make np.NaN persist as pd.NA any time we save data to disk?!

That's indeed a problem if this is roundtripping through numpy, in which case we can't make the distinction (eg if you receive the data from someone else as a numpy array).
For several file formats, though, we will be able to make the distinction. For example binary formats like parquet support both, and in principle also in csv we could support the distinction (although this would not be backwards compatible).

@toobaz
Copy link
Member

toobaz commented Feb 29, 2020

0 and 0 are clearly both non-missing values in your data, so for me this "clearly" is not a case of talking missing information, but rather an unrepresentable floating point number.

Why do 1/0 and 0/0 - both of which, strictly speaking, have no answer (even outside reals) - lead to different results? The only explanation I can see is that you can imagine 1/0 as a limit tending to infinity, while in the case of 0/0 you really have no clue. That "no clue" for me means "missing information", not "error". If the problem was "arithmetic error", you'd have "1/0 = error" as well.

Now, I'm not saying I can read in the mind of whoever wrote the standard, or that I particularly agree with this choice, but this really reminds me (together with my example above about monthly averages, which is maybe more practical) that the difference between "missing" and "invalid" is very subtle, so much so that our intuition about what is missing or not seems already different from that which is present in the IEEE standard.

Also, you then say that np.log([-1]) gives a NaN not because of missing information. So would you then propose to have 0/0 become pd.NA but keep np.log(-1) as resulting in np.nan?

... I'm taking this as something we would consider if we distinguish the two concepts. And since it's everything but obvious (to me at least), I consider this as an argument for not distinguishing the two concepts.

That's indeed a problem if this is roundtripping through numpy, in which case we can't make the distinction (eg if you receive the data from someone else as a numpy array).

I was actually not making a point of "we are constrained by implementation", but really of "what should we conceptually do?". Do we want np.NaN as different from pd.NA because it helps us identify code errors we might want to solve? OK, then once we give up fixing the error in code (for instance because the 0/0 legitimately comes from an average on no observations) we should replace it with pd.NA. Creating np.NaN might be perfectly fine, but distributing it (on pd.NA-aware formats) would be akin to a programming mistake. We are really talking about the result of elementary operations which would (have to) become very context-dependent.

Anyway, if my arguments so far are not seen as convincing, I propose another approach: let us try to define which pandas operations currently producing np.NaN should start to produce pd.NA if we wanted to distinguish the two.

For instance: if data['employee'] is a categorical including empty categories, what should data.groupby('employee')['pay'].mean() return for such categories? pd.NA by default, I guess: there is no data...

What should data.groupby('worker')['pay'].sum() / data.groupby('worker').size() return in those cases? It's a 0/0, so np.NaN.

But these are really the same mathematical operation.

OK, so maybe we would solve the inconsistency if data.groupby('worker')['pay'].sum() already returned pd.NA for such categories. And in general - for consistenty - for sums of empty lists. But we already have Series.sum(min_count=) which has the opposite default behavior, and for very good reasons: the sum of empty lists often has nothing to do with missing data. After a parallel processing operation, how much time did a CPU spend processing tasks if it happened to not process any? Simple: 0. There's no missing data whatsoever.

@dsaxton
Copy link
Member

dsaxton commented Mar 1, 2020

I think what @toobaz is saying is that 0 / 0 truly is indeterminate (if we think of it as the solution to 0x = 0, then it's essentially any number, which isn't too different from the meaning of NA). The log(-1) case is maybe less obvious, but I think you could still defend the choice to represent this as NA (assuming you're not raising an error or using complex numbers) by saying that you're returning "no answer" to these types of queries (and that way keep the meaning as missing data).

I guess I'm still unsure what would be the actual utility of having another value to represent "bad data" when you already have NA for null values? If you're expecting to see a number and don't (because you've taken 0 / 0 for example), how much more helpful is it to see NaN instead of NA?

To me this doesn't seem worth the potential confusion of always having to code around two null values (it's even not obvious if we should treat NaN as missing under this new interpretation; if the answer is no then do we now have to check for two things in places where otherwise we would just ask if something is NA?), and having to remember that they each behave differently. Using only NA would also seemingly make it easier to translate from numpy to pandas (np.nan is always pd.NA, rather than sometimes pd.NA, and other times np.nan depending on context)

(A bit of a tangent from this thread, but reading about infinity above made me wonder if this could also be a useful value to have in other non-float dtypes, for instance infinite Int64 or Datetime values?)

@shoyer
Copy link
Member

shoyer commented Mar 1, 2020

I am coming around to the idea that distinguishing between NaN and NA may not be worth the trouble. I think it would be pretty reasonable to both:

  1. Always use NA instead of NaN for floating point values in pandas. This would change semantics for comparisons, but otherwise would be equivalent. It would not be possible to put NaN in a float array.
  2. Transparently convert NaN -> NA and NA -> NaN when going back and forth with NumPy arrays. This would go a long ways for compatibility the existing ecosystem (e.g., xarray and scikit-learn). I really don't think anyone wants object dtype arrays, and NaN is close enough for libraries built on top of NumPy.

@toobaz
Copy link
Member

toobaz commented Mar 1, 2020

I totally agree with @shoyer 's proposal.

It would be nice to leave a way for users to force keeping np.NaNs as such (in order to keep the old comparisons semantics, and maybe even to avoid the conversions performance hit?), but it might be far from trivial, and hence not worth the effort.

@TomAugspurger
Copy link
Contributor

I’m probably fine with transparently concerting NA to NaN in asarray for float dtypes. I’m less sure for integer, since that goes against our general rule of not being lossy.

@toobaz
Copy link
Member

toobaz commented Mar 1, 2020

I agree. Without pd.NA, pandas users sooner or later were going to get accustomed to ints with missing values magically becoming floats, but that won't be true any more.

(Ideally, we would want a numpy masked array, but I guess asarray can't return that)

@jorisvandenbossche
Copy link
Member Author

[Pietro, about deciding whether an operation should better return NaN or NA] And since it's everything but obvious (to me at least), I consider this as an argument for not distinguishing the two concepts.

I agree it is not obvious what is fundamentally "best". But, if we don't have good arguments either way, that could also be a reason to just follow the standard and what numpy does.

I propose another approach: let us try to define which pandas operations currently producing np.NaN should start to produce pd.NA if we wanted to distinguish the two.

In theory, I think there can be a clear cut: we could produce NaN whenever an operation with numpy produces a NaN, and we produce NAs whenever it is a pandas concept such as alignment or skipna=False that produces NAs.
Now, in practice, there might be corner cases though. Like the unobserved categories you mentioned (which can be seen as missing (-> NA) or as length 0 (-> mean would give NaN). mean([]) might be such a corner case in general. Those corner cases are certainly good reasons to not make the distinction.

I think what @toobaz is saying is that 0 / 0 truly is indeterminate (if we think of it as the solution to 0x = 0, then it's essentially any number, which isn't too different from the meaning of NA).

OK, that I understand!

I guess I'm still unsure what would be the actual utility of having another value to represent "bad data" when you already have NA for null values?

Apart from the (possible) utility for users to be able to represent both (which is of course a trade-off with the added complexity for users of having both), there are also other clear advantages of having both NaN and NA, I think:

  • It is (mostly / more) consistent with R, Julia, SQL, Arrow, ... (basically any other data system I am somewhat familiar with myself)
  • It is easier to implement and possibly more performant / more able to share code with the masked integers.
    (e.g. we don't need to check if NaNs are produced in certain operations to ensure we convert them to NA)

This last item of course gets us into the implementation question (which I actually wanted to avoid initially). But assuming we go with:

Always use NA instead of NaN for floating point values in pandas. This would change semantics for comparisons, but otherwise would be equivalent. It would not be possible to put NaN in a float array.

would people still use NaN as a sentinel for NA, or use a mask and ensure all NaN values in the values are also marked in the mask?
The advantage of using NaN as a sentinel is that we don't need to check for NaN being produced or inserted (as the NaN will be interpreted as NA anyway) and easier conversion to numpy. The advantage of a mask that we can easier share code with the other masked extension arrays (although with a mask property that is dynamically calculated, we can probably still share a lot) and it keeps open the potential of zero copy conversion with Arrow.

@jorisvandenbossche
Copy link
Member Author

I agree that for the conversion to numpy (or at least __array__), we can probably use floats with NaNs.

@TomAugspurger
Copy link
Contributor

FWIW, @kkraus14 noted that cuDF supports both NaN and NA in their float columns, and their users are generally happy with the ability to have both.

@jorisvandenbossche
Copy link
Member Author

We could, yes, but that would be a huge breaking change if the default for nan_as_null would be False (but so that depends a lot on which default we would choose here). At the moment anyone using NaN assumes it to be missing-value, so I think (at least on the short term), we either need to auto-convert to NA or treat NaN as missing as well.

@jbrockmendel
Copy link
Member

jbrockmendel commented Aug 9, 2023

but that would be a huge breaking change if the default for nan_as_null would be False

Yes. So supposing that the default is True.

Update Not having the default be True seems like it would break a ton of existing code for users using np.float64 dtype, which is the vast majority.

@jorisvandenbossche
Copy link
Member Author

Update Not having the default be True seems like it would break a ton of existing code for users using np.float64 dtype, which is the vast majority.

To be clear, I was only talking about the nullable Float64Dtype, not np.float64 (so it wouldn't break those existing users, i.e. just like the current implementation in main, which already has this default).
In my mind this discussion is only about the former (since np.float64 only has NaN), but of course potential inconsistencies between both types can be an argument for certain choices.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 15, 2023

I had a discussion with @jorisvandenbossche today and I would really like to have more discussion on Option 3 above, but with a small twist that I'll call Option 4.

Option 4: distinguish NaN and NA, and don't treat NaN as missing (by default) except upon conversion, and add helper methods

We distinguish NaN and NA everywhere (also in methods that deal with missing values where ONLY NA is considered missing), but let NaNs only get introduced by mathematical operations, and not by construction (except for inputs/file formats that already can distinguish both, of course). We treat np.nan the same way that np.inf is treated today.

In the above examples:

  • construction: by default we consider NaN as NA, so it returns a Series with [1., NA, NA]
  • setitem: by default, we consider NaN as NA, so we convert np.nan to pd.NA in setitem operations
  • mathematical operations: NaN results from invalid mathematical operations, so it returns a Series with [nan, nan]
    methods that handle missing values distinguish NA and NaN, and thus don't treat NaN as missing by default
  • methods that skip missing values (skipna=True) only skip NAs by default

We add helper functions:

  • pd.isnpnan() to test if np.nan is present (which could ONLY occur after a mathematical operation)
  • Series.convert_npnan() that converts any NaN values to pd.NA

When converting from a float array that has np.nan and pd.NA inside to numpy, we leave np.nan alone and convert pd.NA to np.nan. When converting numpy to pandas, we convert np.nan to pd.NA.

The ONLY way that np.nan ever appears in a Series is due to a mathematical operation, and only for Float64Dtype and float[pyarrow]

When someone has a Series that has a mix of np.nan and pd.NA values, np.nan is treated just like np.inf.

Users who are using Float64Dtype or float[pyarrow] can only get np.nan due to mathematical operations. Any constructors or setitem operations using np.nan are converted to pd.NA, which mirrors current behavior.

@shoyer
Copy link
Member

shoyer commented Aug 15, 2023

I don't have strong feelings about how pandas should handle NaN, but I would note that NaN is a floating point number thing, not a NumPy thing. So if pandas has NaN specific APIs, they shouldn't refer to something like "npnan".

@MarcoGorelli
Copy link
Member

Hey @Dr-Irv - just to check, is there any difference between what you're suggesting and the status quo? (other than the extra helper methods)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 25, 2023

Hey @Dr-Irv - just to check, is there any difference between what you're suggesting and the status quo? (other than the extra helper methods)

Interesting. I just did some tests, and almost everything is there. Except this:

When converting from a float array that has np.nan and pd.NA inside to numpy, we leave np.nan alone and convert pd.NA to np.nan.

E.g.:

>>> s=pd.Series([0,1,pd.NA], dtype="Int64")
>>> t=s/s
>>> t
0     NaN
1     1.0
2    <NA>
dtype: Float64
>>> t.to_numpy()
array([nan, 1.0, <NA>], dtype=object)

@daviskirk
Copy link

Also, pyarrow dtypes (float/double) handle nan differently than the numpy dtypes:

>>> s = pd.Series([0.0, 1.0, None], dtype="double[pyarrow]") / 0.0
0     NaN
1     inf
2    <NA>
dtype: double[pyarrow]
>>> pd.isna(s)
0    False
1    False
2     True
dtype: bool

so NaN is NOT considered a na/null.

But when I do the same with numpy floats it IS considered to be na/null.

>>> s = pd.Series([0.0, 1.0, None], dtype="float64") / 0.0)
0    NaN
1    inf
2    NaN
>>> pd.isna(s)
dtype: float64
0     True
1    False
2     True
dtype: bool

I guess this might also be a bug?

@a-reich
Copy link

a-reich commented Dec 9, 2023

Hello, I am still interested in this issue and see a maintainer @jbrockmendel added the label to it and several similar ones for “Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint”. But that plan isn’t mentioned (by this excellent name or other) elsewhere in the repo - could someone briefly state what was agreed to?

@soerenwolfers
Copy link

soerenwolfers commented Jan 12, 2024

#56836 was just closed with a reference to this issue. Pandas is inconsistently mixing up NaN and NA entries in a column of a data type that was specifically designed to add nullability over the still existing non nullable float data types. That's an obvious bug from the user perspective, and nothing is being done about it because of a four year discussion that's going nowhere.

People that don't want to make a distinction between "invalid operation" (NaN) and "missing data" (NA) can use non-nullable floats; people that do can use nullable floats. Wasn't that the whole point of nullability?

Legacy concerns put aside, does anybody disagree with this? (@TomAugspurger @toobaz @MarcoGorelli I think you made comments before that expressed disagreement)

If not, could we reopen #56836 and consider it the first step as part of the de-legacy-ation plan here?

(In fact, I have the feeling this ticket here was originally about whether NaN and NA should be allowed to coexist, but now everybody seems to have come around to that, so the ticket has turned more into a discussion about steps towards allowing it and making the rest of the system sane, like changing isnull(np.nan). Maybe this ticket should therefore be closed and a new one should be opened that's about concrete steps towards a better future, which I believe everybody agrees what it should look like? This might prevent endlessly having to rediscuss the same thing / having this ticket being alive be mistaken for a sign of disagreement.)

As an outsider, this discussion looks like pandas has been stuck for four years just because of an accidental prefix match between two unrelated identifiers.
If IEEE had called it INVALID instead of NaN, would we be having this discussion? SQL, polars, any language with optional data types, they all agree that nullability is not the same as IEEE-754-NaN, and I think it's time pandas catch up.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 12, 2024

@soerenwolfers in the proposals listed at #32265 (comment) and #32265 (comment) , we are proposing to convert np.nan to pd.NA in constructors (with an option that allows np.nan to be preserved).

As an outsider, this discussion looks like pandas has been stuck for four years just because of an accidental prefix match between two unrelated identifiers.

It's more that when pandas was created 15+ years ago, it was based on numpy, and np.nan was then chosen to represent missing values within pandas. The proposals linked in this comment help us provide a transition path for the pandas community that has treated np.nan as meaning "missing value" for a long time. I'm in agreement that they should be separated - it is the transition path that is the real challenge here.

@soerenwolfers
Copy link

@Dr-Irv Great, thanks for the clarification.

FWIW, my vote is on Option 2. IMO Options 3 and 4 might be useful transition steps, but would be serious reason to not use pandas if kept.

@avm19
Copy link

avm19 commented Jun 4, 2024

While this discussion is still stuck (or ongoing if you like), there are some things that could be done regardless. For example, #55787 can be re-opened and addressed: whether or not pd.NA and np.nan are semantically different, there must be no difference between s.isna() and s.map(pd.isna). As a user, I find that this is the only option compatible with sanity.

@soerenwolfers
Copy link

@avm19 depending on which way the issue here is resolved eventually, and depending on which way you propose the functions in the issue you linked should be equalized, the changes that you propose might have to be rolled back later. I guess that's something that the developers want to avoid. The only thing that's worse than internal inconsistency is unnecessary inconsistency between versions.

@glaucouri
Copy link

Another unexpected behavior is that the nan, sometimes loses his 'invalid' meaning, becoming a valid number.

This is very inconsistent

((a:=pd.Series([0],dtype='Float32'))/a).astype('Int32')
# 0    -2147483648
# dtype: Int32

@WillAyd
Copy link
Member

WillAyd commented Jul 23, 2024

After reviewing this discussion here and in PDEP-16, I get the general consensus is that there is value in distinguishing these, but there is a lot of concern around the implementation (and rightfully so, given the history of pandas)

With that being the case, maybe we can just concretely start by adding the nan_is_null keyword to .fillna, .isna, and .hasna with the default value of True that I think @jbrockmendel and @jorisvandenbossche landed on? That maintains backwards compatability and has a prior art in pyarrow (save the fact that pyarrow defaults to False).

Right now the pd.FloatXXDtype() data types become practically unusable the moment a NaN value is introduced, which can happen very easily. By at least giving users the option to .fillna on those types (or filter them out) they could continue to use the extension types, without casting back to NumPy. Right now I think that is the biggest weakness in our extension type system that prevents you from being able to use it generically.

I think starting with a smaller scope to just those few methods is helpful; trying to solve constructors is going to open a can of worms that will deter any progress, and I think more generally should be solved through PDEP-16 anyway (or maybe the follow up to it that focuses on helping to better distinguish these values).

@a-reich
Copy link

a-reich commented Jul 23, 2024

I upvote starting with something that can be improved short-term vs needing to first reach consensus on a new holistic design.

@vkhodygo
Copy link

@WillAyd

That maintains backwards compatibility and has a prior art in pyarrow (save the fact that pyarrow defaults to False).

Considering the fact pandas employs pyarrow engine it should have the same defaults to avoid even more confusion. Breaking backwards compatibility is not a novel thing, one can't just keep legacy code forever; besides, pandas devs do this all the time, don't you.

@WillAyd
Copy link
Member

WillAyd commented Sep 27, 2024

For sure, but our history does make things complicated. Unfortunately, for over a decade pandas users have been commonly doing:

ser.iloc[0] = np.nan

to assign what they think is a "missing value". So we can't just immediately change that to literally mean NaN without some type of transition plan.

There is a larger discussion to that point in PDEP-0016 that you might want to chime in on and follow

#58988 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint
Projects
None yet
Development

No branches or pull requests