Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Inconsistent NaN casting to float64 #46985

Closed
3 tasks done
dvreed77 opened this issue May 10, 2022 · 1 comment
Closed
3 tasks done

BUG: Inconsistent NaN casting to float64 #46985

dvreed77 opened this issue May 10, 2022 · 1 comment
Labels
API Design Bug Duplicate Report Duplicate issue or pull request Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@dvreed77
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df1 = pd.DataFrame({
    'id': [0, 1, 2],
    'null_ints': pd.Series([pd.NA] * 3, dtype="Int64"),
})

df2 = pd.DataFrame({
    'id': [0, 1, 2],
    'null_ints': pd.Series([pd.NA] * 3),
})

df1['null_ints'] = df['null_ints'].astype("float64")
df2['null_ints'] = df['null_ints'].astype("float64")

Issue Description

Pandas inconsistently casts pd.NA values to NaN when casting from Int64 vs object. The latter causes an TypeError, but the former successfully converts pd.NA values to NaN.

Expected Behavior

Either these both fail, or they both succeed

Installed Versions

INSTALLED VERSIONS

commit : 04e01a1
python : 3.8.12.final.0
python-bits : 64
OS : Darwin
OS-release : 21.0.1
Version : Darwin Kernel Version 21.0.1: Tue Sep 14 20:56:24 PDT 2021; root:xnu-8019.30.61~4/RELEASE_ARM64_T6000
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.5.0.dev0+769.g04e01a1de0
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : 21.1.1
setuptools : 56.0.0
Cython : 0.29.28
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.3.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@dvreed77 dvreed77 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 10, 2022
@simonjayhawkins
Copy link
Member

Thanks @dvreed77 for the report.

Pandas inconsistently casts pd.NA values to NaN when casting from Int64 vs object. The latter causes an TypeError, but the former successfully converts pd.NA values to NaN.

pd.NA is a missing value scalar.

The object dtype column holds these objects explicitly (object dtype can hold any object so holds pd.NA if you choose too). So when .astype("float64") you are requesting an explicit cast of pd.NA to float64 dtype and this fails with TypeError: float() argument must be a string or a number, not 'NAType' which seems reasonable as a numpy float64 dtype cannot hold missing values. pd.NA is experimental.

The Int64 dtype has it's own missing value representation (a mask) and the pd.NA object is used as the scalar representation of a missing value in the Int64 column but does not store the missing values as pd.NA so is not explicitly casting any pd.NA values. I think the Int64 dtype is still experimental and representing missing values as np.nan on .astype("float64") is a design choice...

# coerce
if is_float_dtype(dtype):
# In astype, we consider dtype=float to also mean na_value=np.nan
na_value = np.nan

so it maybe that Int64 dtype with missing values should also raise as a regular numpy array does not hold missing values and the experimental EAs are intended to overcome the issues arising from pandas legacy use of np.nan to represent missing values.

There are other issues and ongoing discussion regarding this, eg, #32931 and #32265 so am closing this as a duplicate.

@simonjayhawkins simonjayhawkins added API Design Duplicate Report Duplicate issue or pull request NA - MaskedArrays Related to pd.NA and nullable extension arrays Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Duplicate Report Duplicate issue or pull request Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

No branches or pull requests

2 participants