Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.DataFrame.replace regression causes dtype to remain object #26632

Closed
Kjili opened this issue Jun 3, 2019 · 8 comments · Fixed by #29317
Closed

pd.DataFrame.replace regression causes dtype to remain object #26632

Kjili opened this issue Jun 3, 2019 · 8 comments · Fixed by #29317
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@Kjili
Copy link

Kjili commented Jun 3, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd

# broken after pandas 0.23.4 if only "a" is replaced
def return_replace(initial):
	return initial.replace({"a": 1.0, "b": 0.0})

# working
def return_replace_just_one(initial):
	return initial.replace({"a": 1.0})

# the following should all be float64
print("problem:", return_replace(pd.DataFrame(["a"])).dtypes[0])
print("works:", return_replace(pd.DataFrame(["b"])).dtypes[0])
print("works:", return_replace_just_one(pd.DataFrame(["a"])).dtypes[0])
print("works:", return_replace(pd.DataFrame(["a", "b"])).dtypes[0])

Problem description

The behaviour shown above is inconsistent and hard to spot. In my case, it broke one of my tests due to mismatching types (as I was expecting a float64).
The problem seems to involve a regression when upgrading from pandas 0.23.4 to any later version (tested with 0.24.0, 0.24.1 and 0.24.2, all of which have the same issue).
Returning back to the old behaviour of changing the type also in the first case where it fails above (i.e. changing the type whenever possible?) would be more consistent and not require a manual type definition.

This is probably related to #23305.

Expected Output

problem: float64
works: float64
works: float64
works: float64

Output of pd.show_versions()

Note: The below is for the working version of pandas!

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 5.1.6-arch1-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: de_DE.UTF-8

pandas: 0.23.4
pytest: 4.4.2
pip: 19.1.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: 1.3.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jschendel
Copy link
Member

Please provide a reproducible example with sample data: https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@jschendel jschendel added the Needs Info Clarification about behavior needed to assess issue label Jun 3, 2019
@Kjili
Copy link
Author

Kjili commented Jun 4, 2019

The above shows the inconsistency of the behaviour and should be reproducible (works in my Python console). For a true minimal example of the problematic part only, you can copy&paste the following:

import pandas as pd
df = pd.DataFrame(["a"])
print(df.replace({"a": 1.0, "b": 0.0}).dtypes[0])

This should print float64.

@TomAugspurger
Copy link
Contributor

I suspect this is the same root cause as #26632.

@qtux
Copy link

qtux commented Aug 26, 2019

This bug is still present as of version 0.25.0. Bisecting with Kjilis minimal working example revealed the commit 720d263 to be the one which introduced this bug.

@TomAugspurger
Copy link
Contributor

Thanks for bisecting.

cc @peterpanmj.

@peterpanmj
Copy link
Contributor

I am investigating into it.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 30, 2019 via email

@peterpanmj
Copy link
Contributor

This is indeed a bug .

In [1]: import pandas as pd
   ...: df = pd.DataFrame(["a"])
   ...: print(df.replace({"a": 1.0, "b": 0.0}).dtypes[0])
object
In [2]: import pandas as pd
   ...: df = pd.DataFrame(["a"])
   ...: print(df.replace({"a": 1.0}).dtypes[0])
float64

When there is a scend replacer in the dict, the results are different. I've come up a solution to it in my PR

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Info Clarification about behavior needed to assess issue labels Nov 2, 2019
@jreback jreback added this to the 1.0 milestone Nov 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants