Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.NA in object dtype #32931

Open
simonjayhawkins opened this issue Mar 23, 2020 · 4 comments
Open

pd.NA in object dtype #32931

simonjayhawkins opened this issue Mar 23, 2020 · 4 comments
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Discussion Requires discussion from core team before further action

Comments

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Mar 23, 2020

extract from #32075 (comment)

If we want to handle pd.NA in object dtype better, we will need to start using masks as well, and not rely on numpy behaviour.

For example, also this is wrong:

In [9]: pd.Series([1, pd.NA], dtype=object) >= 1
Out[9]: 
0     True
1    False
dtype: bool

Related issues:

@simonjayhawkins simonjayhawkins added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Mar 23, 2020
@ianbtr
Copy link

ianbtr commented Apr 4, 2020

It looks like pd.NA is dropped entirely when concatenating two dataframes with object dtype, and this is very similar to #33065:

In[4]: df1 = pd.DataFrame(np.full((1, 1), pd.NA))
In[5]: df2 = pd.DataFrame(df1)
In[6]: df1
Out[6]: 
      0
0  <NA>
In[7]: df2
Out[7]: 
      0
0  <NA>
In[8]: pd.concat([df1, df2])
Out[8]: 
     0
0  NaN
0  NaN

This seems to be due to _get_empty_dtype_and_na and JoinUnit.get_reindexed_values in concat.py. The former selects np.nan for anything of dtype 'object'. In this case (and probably many others), 'object' implicitly has a null-value of np.nan.

If the ObjectBlock contained a mask for pd.NA, then JoinUnit.get_reindexed_values could apply it as necessary, and several other functions could use it as well.

Would this work?

@eddy-geek
Copy link

Not quite related -- I am surprised that

  • mix of int and pd.NA defaults to object dtype
  • convert_dtypes leaves it as object

Are these working as designed, already being ironed out, or should I open issues?

df = pd.DataFrame([7,8,9,pd.NA])
print(df)
print('auto        ', df.dtypes)
print('auto convert', df.convert_dtypes().dtypes)

dfi = pd.DataFrame([7,8,9,pd.NA], dtype='Int64')
print(dfi)
print('Int64        ', dfi.dtypes)
print('Int64 convert', dfi.convert_dtypes().dtypes)

print('pandas', pd.__version__)

gives:

      0
0     7
1     8
2     9
3  <NA>
auto         0    object
dtype: object
auto convert 0    object
dtype: object

      0
0     7
1     8
2     9
3  <NA>
Int64         0    Int64
dtype: object
Int64 convert 0    Int64
dtype: object
pandas 1.0.3

@simonjayhawkins
Copy link
Member Author

  • mix of int and pd.NA defaults to object dtype

since pd.NA is experimental, changing the constructor to default to the best possible dtypes using dtypes supporting pd.NA seems reasonable.

  • convert_dtypes leaves it as object

from https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.convert_dtypes.html

Convert columns to best possible dtypes using dtypes supporting pd.NA.

so again these seems to be a reasonable expectation.

however, from https://pandas.pydata.org/docs/dev/user_guide/missing_data.html?highlight=convert_dtypes#conversion

If you have a DataFrame or Series using traditional types that have missing data
represented using np.nan, there are convenience methods
:meth:~Series.convert_dtypes in Series and :meth:~DataFrame.convert_dtypes
in DataFrame that can convert data to use the newer dtypes for integers, strings and
booleans

so it maybe that the convert_dtypes docstring should also be more explicit about the conversion applies to np.nan

or should I open issues?

These two issues could be discussed/addressed independently, so if you could report these as two independent issues, that'll be great.

@mroeschke mroeschke added the Bug label Apr 28, 2020
@MichaelTiemannOSC
Copy link
Contributor

I just hit this bug:

import pandas as pd
xx = pd.DataFrame([[pd.NA]], columns=[2015], index=pd.Index(['S1'], name='metric'))
yy = pd.DataFrame([[pd.NA]], columns=[2015], index=pd.Index(['S2'], name='metric'))
print(pd.concat([xx,yy]))

Expected

       2015
metric     
S1      <NA>
S2      <NA>

but got:

       2015
metric     
S1      NaN
S2      NaN

Maybe we need a dtype Object which is like object except it doesn't cast pd.NA to np.nan? <<<---Half-hearted joke.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants