Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: fix corner case of lib.infer_dtype #23422

Merged
merged 10 commits into from
Nov 4, 2018

Conversation

h-vetinari
Copy link
Contributor

Working towards #23167 needs fixing of the type inference for some corner cases (especially not inferring an object column full of NaNs to be 'floating').

@pep8speaks
Copy link

Hello @h-vetinari! Thanks for submitting the PR.

@codecov
Copy link

codecov bot commented Oct 30, 2018

Codecov Report

Merging #23422 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #23422   +/-   ##
=======================================
  Coverage   92.23%   92.23%           
=======================================
  Files         161      161           
  Lines       51197    51197           
=======================================
  Hits        47220    47220           
  Misses       3977     3977
Flag Coverage Δ
#multiple 90.61% <ø> (ø) ⬆️
#single 42.27% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d78bd7a...19cf2dd. Read the comment docs.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Oct 30, 2018

Annoyingly, io.sql.SQLTable.create_table needs to have the dtype returned even for empty columns, otherwise the test for empty tables fails. I had been able to fix the pytables failures that appeared in the CI, but I'm now reverting that (breaking) change.

I still think it's a good idea, but fixing that - depending on the use case - would probably need another keyword that determines whether to prioritize 'empty' over the dtype or vice versa.

In any case, what remains of this PR is fixing a corner case that is more clearly wrong (returning 'floating' for an all-NA object column if skipna=True).

h-vetinari added a commit to h-vetinari/pandas that referenced this pull request Oct 30, 2018
@gfyoung gfyoung added Dtype Conversions Unexpected or buggy dtype conversions Internals Related to non-user accessible pandas implementation labels Oct 30, 2018
@@ -244,6 +244,8 @@ Backwards incompatible API changes

- A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`)
- :meth:`Series.str.cat` will now raise if `others` is a `set` (:issue:`23009`)
- The method `pandas._libs.lib.infer_dtype` now returns `'empty'` rather than (sometimes) the dtype of the array,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this bug fix API-facing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO no, but I was being careful here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a public function, remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -1171,6 +1172,9 @@ def infer_dtype(object value, bint skipna=False):
values = construct_1d_object_array_from_listlike(value)

values = getattr(values, 'values', values)
if skipna:
values = values[~isnaobj(values)]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can incorporate this skipna logic into the for-loop below. Perhaps have an indicator to tell us whether we have seen an element in the values array that is non-null (when skipna is True).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, that's not directly possible (nor performant), because the line directly below (with _try_infer_map) will return prematurely as soon as it can grab hold of a dtype.

@h-vetinari
Copy link
Contributor Author

@gfyoung
Could you please restart the failed travis job. It was a hypothesis timeout, nothing else.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@h-vetinari you need to respond on the issue
this is as expected behavior

@h-vetinari
Copy link
Contributor Author

@jreback
There are two aspects to #23421:

  1. whether lib.infer_dtype(pd.Series([])) returns 'floating' vs. 'empty'
  2. whether lib.infer_dtype(pd.Series([np.nan, np.nan], dtype=object), skipna=True) returns 'floating' vs. 'empty'

I've reverted the fix for 1. (after the SQL errors I mentioned above), but kept 2. This is also not about the issue you linked #17261, but about an oversight of #17066. All-NA object column + skipna=True should clearly not return 'floating' (and this is causing errors in #23167).

])
def test_object_empty(self, dtype, skipna, expected):
# GH 23421
arr = pd.Series([np.nan, np.nan], dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also test with None passed in (for object dtype)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -244,6 +244,8 @@ Backwards incompatible API changes

- A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`)
- :meth:`Series.str.cat` will now raise if `others` is a `set` (:issue:`23009`)
- The method `pandas._libs.lib.infer_dtype` now returns `'empty'` rather than (sometimes) the dtype of the array,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a public function, remove

@@ -1171,6 +1172,9 @@ def infer_dtype(object value, bint skipna=False):
values = construct_1d_object_array_from_listlike(value)

values = getattr(values, 'values', values)
if skipna:
values = values[~isnaobj(values)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a python and not a cimport, why are you not using checknull?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checknull only returns a single bint, and not an array. I would have liked to cimport isnaobj, but that didn't work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this isn array, ok, then add isnaobj to missing.pxd and make it a cpdef. then you can cimport it. (and you need to type return value)

pandas/tests/dtypes/test_inference.py Show resolved Hide resolved
Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for review

@@ -244,6 +244,8 @@ Backwards incompatible API changes

- A newly constructed empty :class:`DataFrame` with integer as the ``dtype`` will now only be cast to ``float64`` if ``index`` is specified (:issue:`22858`)
- :meth:`Series.str.cat` will now raise if `others` is a `set` (:issue:`23009`)
- The method `pandas._libs.lib.infer_dtype` now returns `'empty'` rather than (sometimes) the dtype of the array,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -1171,6 +1172,9 @@ def infer_dtype(object value, bint skipna=False):
values = construct_1d_object_array_from_listlike(value)

values = getattr(values, 'values', values)
if skipna:
values = values[~isnaobj(values)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checknull only returns a single bint, and not an array. I would have liked to cimport isnaobj, but that didn't work.

pandas/tests/dtypes/test_inference.py Show resolved Hide resolved
])
def test_object_empty(self, dtype, skipna, expected):
# GH 23421
arr = pd.Series([np.nan, np.nan], dtype=dtype)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -1171,6 +1172,9 @@ def infer_dtype(object value, bint skipna=False):
values = construct_1d_object_array_from_listlike(value)

values = getattr(values, 'values', values)
if skipna:
values = values[~isnaobj(values)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this isn array, ok, then add isnaobj to missing.pxd and make it a cpdef. then you can cimport it. (and you need to type return value)

@jreback jreback added this to the 0.24.0 milestone Nov 1, 2018
h-vetinari added a commit to h-vetinari/pandas that referenced this pull request Nov 1, 2018
@h-vetinari
Copy link
Contributor Author

@jreback
All green.

@jreback
Copy link
Contributor

jreback commented Nov 2, 2018

can you rebase

h-vetinari added a commit to h-vetinari/pandas that referenced this pull request Nov 2, 2018
@@ -1,8 +1,14 @@
# -*- coding: utf-8 -*-

from numpy cimport ndarray, uint8_t

from tslibs.nattype cimport is_null_datetimelike
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you added this back in rebase. pls remove

from numpy cimport ndarray, uint8_t

from tslibs.nattype cimport is_null_datetimelike

cpdef bint checknull(object val)
cpdef bint checknull_old(object val)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no extra blank lines

@jreback jreback merged commit aaaac86 into pandas-dev:master Nov 4, 2018
@jreback
Copy link
Contributor

jreback commented Nov 4, 2018

thanks

@h-vetinari h-vetinari deleted the infer_empty branch November 5, 2018 00:14
JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: fix corner cases of lib.infer_dtype
4 participants