Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix combine_first converting timestamp to int #35514

Closed
wants to merge 11 commits into from

Conversation

nixphix
Copy link
Contributor

@nixphix nixphix commented Aug 2, 2020

This fix introduced two regression, but it appears like the fix only made the API consistent. Previously the failing regressions where inconsistent say df1.combine_first(df2) would not return the same result as df2.combine_first(df1) for the failing cases, more on these in the code comments.

Let me know if there is a better way to handle this.

@nixphix nixphix marked this pull request as ready for review August 2, 2020 06:54
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you post an example of what the change to a user would look like (e.g. what the 'regression' )is

pandas/tests/frame/methods/test_combine_first.py Outdated Show resolved Hide resolved
pandas/tests/frame/methods/test_combine_first.py Outdated Show resolved Hide resolved
pandas/tests/frame/methods/test_combine_first.py Outdated Show resolved Hide resolved
@nixphix
Copy link
Contributor Author

nixphix commented Aug 6, 2020

can you post an example of what the change to a user would look like (e.g. what the 'regression' )is

@jreback master is inconsistent when the other and self df's are swapped this is not captured by current test cases

# master
import pandas as pd
import numpy as np

df1 = pd.DataFrame([[np.nan, 3.0, True], [-4.6, np.nan, True], [np.nan, 7.0, False]])
df2 = pd.DataFrame([[-42.6, np.nan, True], [-5.0, 1.6, False]], index=[1, 2])

expected = pd.Series([True, True, False], name=2)
result1 = df1.combine_first(df2)[2]
result2 = df2.combine_first(df1)[2]

print(expected.dtype, result1.dtype, result2.dtype)
>>>bool bool object
# master
import pandas as pd

df1 = pd.DataFrame({"a": [0, 1, 3, 5]}, dtype="int64")
df2 = pd.DataFrame({"a": [1, 4]}, dtype="int64")

res1 = df1.combine_first(df2)
res2 = df2.combine_first(df1)

print(res1["a"].dtype, res2["a"].dtype)
>>>int64 float64

the current fix makes the function consistent but does make the result different from expected, in the first case now both ways we get a series of type object and the second case returns float64 either way.

# fix
import pandas as pd
import numpy as np

df1 = pd.DataFrame([[np.nan, 3.0, True], [-4.6, np.nan, True], [np.nan, 7.0, False]])
df2 = pd.DataFrame([[-42.6, np.nan, True], [-5.0, 1.6, False]], index=[1, 2])

expected = pd.Series([True, True, False], name=2)
result1 = df1.combine_first(df2)[2]
result2 = df2.combine_first(df1)[2]

print(expected.dtype, result1.dtype, result2.dtype)
>>>bool object object
# fix
import pandas as pd

df1 = pd.DataFrame({"a": [0, 1, 3, 5]}, dtype="int64")
df2 = pd.DataFrame({"a": [1, 4]}, dtype="int64")

res1 = df1.combine_first(df2)
res2 = df2.combine_first(df1)

print(res1["a"].dtype, res2["a"].dtype)
>>>float64 float64

let me know your thought on this.

@jbrockmendel
Copy link
Member

@nixphix this looks pretty good. can you merge master and we'll see if the CI passes

@nixphix
Copy link
Contributor Author

nixphix commented Oct 4, 2020

@jreback I have made the code changes but the ci is failing on an unrelated test case test_2d_to_1d_assignment_raises. Believe this is caused by breaking changes in dependency or the issue is in the master.

@nixphix nixphix requested a review from jreback October 6, 2020 00:23
@jreback jreback added this to the 1.2 milestone Oct 6, 2020
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Review labels Oct 6, 2020
@nixphix nixphix requested a review from jreback October 7, 2020 17:41
@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2020

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Nov 7, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. if you can parameterize those tests and ping on green.

result = df1.combine_first(df2)[2]
expected = Series([True, True, False], name=2)
tm.assert_series_equal(result, expected)
expected1 = pd.Series([True, True, False], name=2, dtype=object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize this test (e.g. put the df1 and df2 in the parameter along with the expected)

@@ -339,9 +348,14 @@ def test_combine_first_int(self):
df1 = pd.DataFrame({"a": [0, 1, 3, 5]}, dtype="int64")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize this as well

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nixphix lgtm, can you merge master and address the test comments, ping on green.

res = df1.combine_first(df2)
tm.assert_frame_equal(res, df1)
assert res["a"].dtype == "int64"
exp1 = pd.DataFrame({"a": [0, 1, 3, 5]}, dtype="float64")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and make a seaparte test

@jreback jreback removed this from the 1.2 milestone Nov 18, 2020
@jreback
Copy link
Contributor

jreback commented Nov 25, 2020

this PR is prob ok, just needs a rebase and updating for comments.

],
)
def test_combine_first_timestamp_bug(val1, val2, nulls_fixture):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here "# GH#35514"

@arw2019
Copy link
Member

arw2019 commented Nov 29, 2020

Closing in favor of #38145

@arw2019 arw2019 closed this Nov 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

combine_first returns unexpected results for timestamp dataframes
5 participants