pd.NA in object dtype #32931

simonjayhawkins · 2020-03-23T15:23:14Z

extract from #32075 (comment)

If we want to handle pd.NA in object dtype better, we will need to start using masks as well, and not rely on numpy behaviour.

For example, also this is wrong:

In [9]: pd.Series([1, pd.NA], dtype=object) >= 1
Out[9]: 
0     True
1    False
dtype: bool

Related issues:

Series.replace fails to replace value #32075 (replace not working)
BUG: truncated repr with pd.NA in object dtype column shows "NaN" #33065 (wrong truncated repr)
BUG: pd.NA acts differently when inside/outside a series/dataframe with object dtype #33066 (wrong comparison operations)
BUG: Cannot index into DataFrame with Nullable Integer as index #34497 (indexing with duplicates)

The text was updated successfully, but these errors were encountered:

ianbtr · 2020-04-04T14:25:25Z

It looks like pd.NA is dropped entirely when concatenating two dataframes with object dtype, and this is very similar to #33065:

In[4]: df1 = pd.DataFrame(np.full((1, 1), pd.NA))
In[5]: df2 = pd.DataFrame(df1)
In[6]: df1
Out[6]: 
      0
0  <NA>
In[7]: df2
Out[7]: 
      0
0  <NA>
In[8]: pd.concat([df1, df2])
Out[8]: 
     0
0  NaN
0  NaN

This seems to be due to _get_empty_dtype_and_na and JoinUnit.get_reindexed_values in concat.py. The former selects np.nan for anything of dtype 'object'. In this case (and probably many others), 'object' implicitly has a null-value of np.nan.

If the ObjectBlock contained a mask for pd.NA, then JoinUnit.get_reindexed_values could apply it as necessary, and several other functions could use it as well.

Would this work?

eddy-geek · 2020-04-16T18:50:45Z

Not quite related -- I am surprised that

mix of int and pd.NA defaults to object dtype
convert_dtypes leaves it as object

Are these working as designed, already being ironed out, or should I open issues?

df = pd.DataFrame([7,8,9,pd.NA])
print(df)
print('auto        ', df.dtypes)
print('auto convert', df.convert_dtypes().dtypes)

dfi = pd.DataFrame([7,8,9,pd.NA], dtype='Int64')
print(dfi)
print('Int64        ', dfi.dtypes)
print('Int64 convert', dfi.convert_dtypes().dtypes)

print('pandas', pd.__version__)

gives:

      0
0     7
1     8
2     9
3  <NA>
auto         0    object
dtype: object
auto convert 0    object
dtype: object

      0
0     7
1     8
2     9
3  <NA>
Int64         0    Int64
dtype: object
Int64 convert 0    Int64
dtype: object
pandas 1.0.3

simonjayhawkins · 2020-04-17T10:50:47Z

mix of int and pd.NA defaults to object dtype

since pd.NA is experimental, changing the constructor to default to the best possible dtypes using dtypes supporting pd.NA seems reasonable.

convert_dtypes leaves it as object

from https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.convert_dtypes.html

Convert columns to best possible dtypes using dtypes supporting pd.NA.

so again these seems to be a reasonable expectation.

however, from https://pandas.pydata.org/docs/dev/user_guide/missing_data.html?highlight=convert_dtypes#conversion

If you have a DataFrame or Series using traditional types that have missing data
represented using np.nan, there are convenience methods
:meth:~Series.convert_dtypes in Series and :meth:~DataFrame.convert_dtypes
in DataFrame that can convert data to use the newer dtypes for integers, strings and
booleans

so it maybe that the convert_dtypes docstring should also be more explicit about the conversion applies to np.nan

or should I open issues?

These two issues could be discussed/addressed independently, so if you could report these as two independent issues, that'll be great.

MichaelTiemannOSC · 2023-07-21T23:04:05Z

I just hit this bug:

import pandas as pd
xx = pd.DataFrame([[pd.NA]], columns=[2015], index=pd.Index(['S1'], name='metric'))
yy = pd.DataFrame([[pd.NA]], columns=[2015], index=pd.Index(['S2'], name='metric'))
print(pd.concat([xx,yy]))

Expected

       2015
metric     
S1      <NA>
S2      <NA>

but got:

       2015
metric     
S1      NaN
S2      NaN

Maybe we need a dtype Object which is like object except it doesn't cast pd.NA to np.nan? <<<---Half-hearted joke.

simonjayhawkins added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Mar 23, 2020

simonjayhawkins mentioned this issue Mar 23, 2020

Series.replace fails to replace value #32075

Closed

jorisvandenbossche mentioned this issue Mar 27, 2020

BUG: pd.NA acts differently when inside/outside a series/dataframe with object dtype #33066

Open

eddy-geek mentioned this issue Apr 19, 2020

BUG: mix of int and pd.NA defaults to object dtype #33662

Open

2 tasks

mroeschke added the Bug label Apr 28, 2020

jorisvandenbossche mentioned this issue May 31, 2020

BUG: Cannot index into DataFrame with Nullable Integer as index #34497

Closed

moskvax mentioned this issue Jun 10, 2020

[SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns apache/spark#28743

Closed

dsaxton mentioned this issue Jun 25, 2020

BUG: Unexpected behaviour comparison dataframes with None values #34975

Closed

1 task

simonjayhawkins mentioned this issue Aug 6, 2020

ENH: Allow opting in to new dtypes on I/O routines via keyword to I/O routines #29752

Closed

TomAugspurger mentioned this issue Sep 25, 2020

set_index can not cope with pandas dtypes dask/dask#6671

Closed

jorisvandenbossche mentioned this issue Nov 3, 2020

tm.assert_index_equal broken with pd.NA and np.NaT sentinel #31884

Closed

jorisvandenbossche mentioned this issue Dec 2, 2020

API: bool(pd.NA) #38224

Open

mzeitlin11 mentioned this issue Apr 21, 2021

BUG: Errors caused by DataFrame.all(..., skipna=False, ...) in rows without na values. #41079

Open

3 tasks

mzeitlin11 mentioned this issue Aug 3, 2021

BUG: negating pd.Na and None #42862

Open

3 tasks

phofl mentioned this issue Aug 14, 2021

BUG: .value_counts converts pd.NA to np.nan #42851

Open

3 tasks

jbrockmendel mentioned this issue Dec 18, 2021

ROADMAP: Consistent missing value handling with new NA scalar #28095

Open

jbrockmendel mentioned this issue Jan 25, 2022

BUG: Row-wise comparison between two series always evaluates to all False when one series contains pd.NA #45599

Closed

3 tasks

simonjayhawkins mentioned this issue May 16, 2022

BUG: Inconsistent NaN casting to float64 #46985

Closed

3 tasks

mzeitlin11 mentioned this issue Sep 1, 2022

BUG: Summing a series of bools doesn't return a number #48325

Closed

3 tasks

jorisvandenbossche mentioned this issue Dec 11, 2022

CI: PY311 failures #50124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.NA in object dtype #32931

pd.NA in object dtype #32931

simonjayhawkins commented Mar 23, 2020 •

edited by jorisvandenbossche

Loading

ianbtr commented Apr 4, 2020 •

edited

Loading

eddy-geek commented Apr 16, 2020

simonjayhawkins commented Apr 17, 2020

MichaelTiemannOSC commented Jul 21, 2023

pd.NA in object dtype #32931

pd.NA in object dtype #32931

Comments

simonjayhawkins commented Mar 23, 2020 • edited by jorisvandenbossche Loading

ianbtr commented Apr 4, 2020 • edited Loading

eddy-geek commented Apr 16, 2020

simonjayhawkins commented Apr 17, 2020

MichaelTiemannOSC commented Jul 21, 2023

simonjayhawkins commented Mar 23, 2020 •

edited by jorisvandenbossche

Loading

ianbtr commented Apr 4, 2020 •

edited

Loading