BUG: Adding Series to empty DataFrame can reset dtype to float64 #42099

JBGreisman · 2021-06-18T02:44:56Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas (1.3.0rc1).
(optional) I have confirmed this bug exists on the master branch of pandas

Code Sample, a copy-pastable example

import pandas as pd
data = pd.array([0, 1, 2, 3], dtype="Int32")
df = expected = pd.DataFrame({"data": pd.Series(data)})
result = pd.DataFrame(index=df.index)
result.loc[df.index, "data"] = df["data"]

print(df["data"].dtype)     # prints: Int32
print(result["data"].dtype) # prints: float64 <--

Problem description

In my mind, this behavior seems unexpected because the provided dtype should be preserved and not coerced to the default type for an empty Series. This occurs for the nullable integer dtypes as well as Float32/Float64.

I came across this when trying to implement an ExtensionDtype that ended up failing on BaseSetitemTest. test_setitem_with_expansion_dataframe_column:

pandas/pandas/tests/extension/base/setitem.py

Lines 335 to 343 in 648eb40

    
           def test_setitem_with_expansion_dataframe_column(self, data, full_indexer): 
        
               # https://github.com/pandas-dev/pandas/issues/32395 
        
               df = expected = pd.DataFrame({"data": pd.Series(data)}) 
        
               result = pd.DataFrame(index=df.index) 
        
               key = full_indexer(df) 
        
               result.loc[key, "data"] = df["data"] 
        
               self.assert_frame_equal(result, expected)

Interestingly, in the tests for IntegerArray and FloatingArray, the test data includes NaN values which do not result in the coercion to float64:

import pandas as pd
data = pd.array([0, pd.NaT, 2, 3], dtype="Int32")
df = expected = pd.DataFrame({"data": pd.Series(data)})
result = pd.DataFrame(index=df.index)
result.loc[df.index, "data"] = df["data"]

print(df["data"].dtype)     # prints: Int32
print(result["data"].dtype) # prints: Int32 <--

My expectation was that the dtype should be preserved in such cases, with/without NaN values.

Expected Output

I would expect that the dtype of the pd.Series being added to result would be preserved, in the case of the minimal example, result["data"] should be Int32Dtype.

print(df["data"].dtype)     # prints: Int32
print(result["data"].dtype) # prints: Int32 <--

Output of `pd.show_versions()`

This was generated from the latest release candidate, but it appears to also occur on the master branch (1.4.0.dev0+56.g648eb40abc)

INSTALLED VERSIONS

commit : 2dd9e9b
python : 3.8.5.final.0
python-bits : 64
OS : Darwin
OS-release : 17.7.0
Version : Darwin Kernel Version 17.7.0: Fri Oct 30 13:34:27 PDT 2020; root:xnu-4570.71.82.8~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.0rc1
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 49.6.0.post20210108
Cython : None
pytest : 6.2.1
hypothesis : None
sphinx : 3.3.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.24.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.05.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.3
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2021-06-18T11:46:33Z

Thanks @JBGreisman for the report.

Expected Output

I would expect that the dtype of the pd.Series being added to result would be preserved, in the case of the minimal example, result["data"] should be Int32Dtype.

That was indeed the output in pandas 1.2.4 and earlier.

first bad commit: [527c789] API/BUG: always try to operate inplace when setting with loc/iloc[foo, bar] (#39163) cc @jbrockmendel

jbrockmendel · 2021-06-19T00:07:37Z

Based on a quick look:

In Loc._setitem_with_indexer L1664 we add a new column with self.obj[key] = infer_fill_value(value). Here value is df["data"] from the OP. infer_fill_value(value) gives np.nan, so the inserted column is float64. Then later we come back and set the values into the new column. Because df["data"] can be written into a float64 column losslessly, it is done inplace.

Solution: make infer_fill_value smarter about preserving dtype.

JBGreisman added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 18, 2021

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jun 18, 2021

code sample for pandas-dev#42099

ec017f9

simonjayhawkins added Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 18, 2021

simonjayhawkins added this to the 1.3 milestone Jun 18, 2021

jbrockmendel mentioned this issue Jun 21, 2021

REGR: preserve Int32 dtype on setitem #42166

Merged

4 tasks

jreback closed this as completed in #42166 Jun 21, 2021

LouisLU9911 mentioned this issue Nov 30, 2022

BUG: Cannot use .loc to set a ndarray as the value of an empty dataframe #49972

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Adding Series to empty DataFrame can reset dtype to float64 #42099

BUG: Adding Series to empty DataFrame can reset dtype to float64 #42099

JBGreisman commented Jun 18, 2021

INSTALLED VERSIONS

simonjayhawkins commented Jun 18, 2021

Expected Output

jbrockmendel commented Jun 19, 2021

BUG: Adding Series to empty DataFrame can reset dtype to float64 #42099

BUG: Adding Series to empty DataFrame can reset dtype to float64 #42099

Comments

JBGreisman commented Jun 18, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

simonjayhawkins commented Jun 18, 2021

Expected Output

jbrockmendel commented Jun 19, 2021

Output of `pd.show_versions()`