Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Adding Series to empty DataFrame can reset dtype to float64 #42099

Closed
3 tasks done
JBGreisman opened this issue Jun 18, 2021 · 2 comments · Fixed by #42166
Closed
3 tasks done

BUG: Adding Series to empty DataFrame can reset dtype to float64 #42099

JBGreisman opened this issue Jun 18, 2021 · 2 comments · Fixed by #42166
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@JBGreisman
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas (1.3.0rc1).

  • (optional) I have confirmed this bug exists on the master branch of pandas


Code Sample, a copy-pastable example

import pandas as pd
data = pd.array([0, 1, 2, 3], dtype="Int32")
df = expected = pd.DataFrame({"data": pd.Series(data)})
result = pd.DataFrame(index=df.index)
result.loc[df.index, "data"] = df["data"]

print(df["data"].dtype)     # prints: Int32
print(result["data"].dtype) # prints: float64 <--

Problem description

In my mind, this behavior seems unexpected because the provided dtype should be preserved and not coerced to the default type for an empty Series. This occurs for the nullable integer dtypes as well as Float32/Float64.

I came across this when trying to implement an ExtensionDtype that ended up failing on BaseSetitemTest. test_setitem_with_expansion_dataframe_column:

def test_setitem_with_expansion_dataframe_column(self, data, full_indexer):
# https://github.com/pandas-dev/pandas/issues/32395
df = expected = pd.DataFrame({"data": pd.Series(data)})
result = pd.DataFrame(index=df.index)
key = full_indexer(df)
result.loc[key, "data"] = df["data"]
self.assert_frame_equal(result, expected)

Interestingly, in the tests for IntegerArray and FloatingArray, the test data includes NaN values which do not result in the coercion to float64:

import pandas as pd
data = pd.array([0, pd.NaT, 2, 3], dtype="Int32")
df = expected = pd.DataFrame({"data": pd.Series(data)})
result = pd.DataFrame(index=df.index)
result.loc[df.index, "data"] = df["data"]

print(df["data"].dtype)     # prints: Int32
print(result["data"].dtype) # prints: Int32 <--

My expectation was that the dtype should be preserved in such cases, with/without NaN values.

Expected Output

I would expect that the dtype of the pd.Series being added to result would be preserved, in the case of the minimal example, result["data"] should be Int32Dtype.

print(df["data"].dtype)     # prints: Int32
print(result["data"].dtype) # prints: Int32 <--

Output of pd.show_versions()

This was generated from the latest release candidate, but it appears to also occur on the master branch (1.4.0.dev0+56.g648eb40abc)

INSTALLED VERSIONS

commit : 2dd9e9b
python : 3.8.5.final.0
python-bits : 64
OS : Darwin
OS-release : 17.7.0
Version : Darwin Kernel Version 17.7.0: Fri Oct 30 13:34:27 PDT 2020; root:xnu-4570.71.82.8~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.0rc1
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 49.6.0.post20210108
Cython : None
pytest : 6.2.1
hypothesis : None
sphinx : 3.3.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.24.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.05.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.3
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : None

@JBGreisman JBGreisman added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 18, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jun 18, 2021
@simonjayhawkins
Copy link
Member

Thanks @JBGreisman for the report.

Expected Output

I would expect that the dtype of the pd.Series being added to result would be preserved, in the case of the minimal example, result["data"] should be Int32Dtype.

That was indeed the output in pandas 1.2.4 and earlier.

first bad commit: [527c789] API/BUG: always try to operate inplace when setting with loc/iloc[foo, bar] (#39163) cc @jbrockmendel

@simonjayhawkins simonjayhawkins added Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 18, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3 milestone Jun 18, 2021
@jbrockmendel
Copy link
Member

Based on a quick look:

In Loc._setitem_with_indexer L1664 we add a new column with self.obj[key] = infer_fill_value(value). Here value is df["data"] from the OP. infer_fill_value(value) gives np.nan, so the inserted column is float64. Then later we come back and set the values into the new column. Because df["data"] can be written into a float64 column losslessly, it is done inplace.

Solution: make infer_fill_value smarter about preserving dtype.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants