Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix empty Data frames to JSON round-trippable back to data frames #21318

Merged
merged 15 commits into from
Jun 8, 2018

Conversation

pyryjook
Copy link
Contributor

@pyryjook pyryjook commented Jun 4, 2018

[x] closes #21287
[x] tests added / passed
[x]passes git diff upstream/master -u -- "*.py" | flake8 --diff
[x]whatsnew entry

Fixes the bug occurring when empty DF, previously saved to JSON-file, is read from JSON back to DF.

@WillAyd
Copy link
Member

WillAyd commented Jun 4, 2018

Can you add a test to cover this?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need test and whatsnew

@WillAyd WillAyd added the IO JSON read_json, to_json, json_normalize label Jun 4, 2018
@pyryjook
Copy link
Contributor Author

pyryjook commented Jun 4, 2018

Sure, I'll add those both!

@codecov
Copy link

codecov bot commented Jun 4, 2018

Codecov Report

Merging #21318 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21318      +/-   ##
==========================================
- Coverage   91.89%   91.85%   -0.04%     
==========================================
  Files         153      153              
  Lines       49596    49570      -26     
==========================================
- Hits        45576    45533      -43     
- Misses       4020     4037      +17
Flag Coverage Δ
#multiple 90.25% <100%> (-0.04%) ⬇️
#single 41.86% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/table_schema.py 98.29% <100%> (ø) ⬆️
pandas/plotting/_core.py 82.39% <0%> (-1.15%) ⬇️
pandas/core/dtypes/missing.py 91.95% <0%> (-0.58%) ⬇️
pandas/core/dtypes/cast.py 88.06% <0%> (-0.16%) ⬇️
pandas/core/ops.py 96.35% <0%> (-0.06%) ⬇️
pandas/core/series.py 94.12% <0%> (-0.06%) ⬇️
pandas/core/reshape/merge.py 94.25% <0%> (ø) ⬆️
pandas/io/json/normalize.py 96.93% <0%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d5032a...8d5f127. Read the comment docs.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments but nothing major - just need to standardize and clean up the code here a bit

@@ -86,6 +87,16 @@ def test_multiindex(self):
assert result == expected


class TestParseSchema(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't need a new class here - can you just move this to TestTableOrientReader?

@@ -86,6 +87,16 @@ def test_multiindex(self):
assert result == expected


class TestParseSchema(object):

def test_empty_json_data(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this to test_empty_frame_roundtrip

# GH21287
df = pd.DataFrame([], columns=['a', 'b', 'c'])
json = df.to_json(None, orient='table')
df = parse_table_schema(json, True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use pd.read_json here instead

@@ -92,7 +92,7 @@ I/O

- Bug in IO methods specifying ``compression='zip'`` which produced uncompressed zip archives (:issue:`17778`, :issue:`21144`)
- Bug in :meth:`DataFrame.to_stata` which prevented exporting DataFrames to buffers and most file-like objects (:issue:`21041`)
-
- Bug in IO JSON methods reading empty JSON schema back to DataFrame caused an error (:issue:`21287`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly reference :func:`read_json` and make sure to qualify that this only applies when ``orient='table'``

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also update reference to :class:`DataFrame`

df = pd.DataFrame([], columns=['a', 'b', 'c'])
json = df.to_json(None, orient='table')
df = parse_table_schema(json, True)
assert df.empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure that we've preserved the frame metadata we should use the tm.assert_frame_equal function here. Typically with tests we will:

  1. Create a variable called expected, which is very explicit about what we want (here that's just a copy of df)
  2. Assign to a variable called result, which here would be the result of pd.read_json
  3. Make sure result and expected match using tm.assert_frame_equal

@pyryjook
Copy link
Contributor Author

pyryjook commented Jun 5, 2018

Thanks for the comments! I'll look into this later today.

@pep8speaks
Copy link

pep8speaks commented Jun 5, 2018

Hello @pyryjook! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 08, 2018 at 23:39 Hours UTC

expected = df.copy()
out = df.to_json(None, orient='table')
result = pd.read_json(out, orient='table')
tm.assert_frame_equal(expected, result)
Copy link
Contributor Author

@pyryjook pyryjook Jun 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises an assertion error:

E       AssertionError: DataFrame.index are different
E
E       DataFrame.index classes are not equivalent
E       [left]:  Index([], dtype='object')
E       [right]: Float64Index([], dtype='float64')

That's something I need to dig deeper. If there is something obvious, that I'm missing, any pointers would be appreciated in such case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this PR! Beat me to it :)
A bit weak but what do we think of just
pd.testing.assert_frame_equal(expected, actual, check_dtype=False) ?

Otherwise I would guess we have to go down the road of including the dtypes in the JSON representation?

Copy link
Contributor Author

@pyryjook pyryjook Jun 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! And actually it only works if I assert it like this:
pd.testing.assert_frame_equal(expected, result, check_dtype=False, check_index_type=False)
So both check_dtypeand check_index_type have to be set to False in order to get the assertion right.

Thoughts on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the dtype difference is not the solution. The point of this format is to persist that metadata.

What I would do is check that the proper type information for the index is being written out (you can use a io.StringIO instance instead of writing to None). If that appears correct then there would be an issue with the reader that is ignoring or casting the type of the index after the fact

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right, ignoring the dtype and the index_type will just hide the problem.

Did some initial testings and it seems that on the reading side empty data with data.dtype == 'object' gets coerced to Float64 without any clear reason.

I'll push a commit with fix proposal for comments.

@@ -686,7 +686,7 @@ def _try_convert_data(self, name, data, use_dtypes=True,

result = False

if data.dtype == 'object':
if len(data) and data.dtype == 'object':
Copy link
Contributor Author

@pyryjook pyryjook Jun 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix that seems to solve the error.

Any thoughts on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of the JSON table schema is that we can be explicit about the types of the columns, so we shouldn't need to infer anything. Any way to avoid a call to this method altogether?

Copy link
Member

@gfyoung gfyoung Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of the JSON table schema is that we can be explicit about the types of the columns, so we shouldn't need to infer anything.

@WillAyd : I'm confused how this is relevant to the patch. It looks fine to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that since the datatypes are explicitly defined in the schema that we shouldn't need any type of inference, which I get the impression this method is doing

Copy link
Member

@gfyoung gfyoung Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that since the datatypes are explicitly defined in the schema that we shouldn't need any type of inference, which I get the impression this method is doing

Sure, but I'm still not seeing relevance to this particular PR. The patch is pretty straightforward from what I see.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal opinion, I don't think option 2 makes much sense. As a user, I would rather have consistent, though undesirable behavior (and leave the test a bit weak, ignoring dtypes, or perhaps marking as expected failure?) than have empty and non empty data frame behave differently.
If at least non empty dataframe behaved as expected, and there was bad behavior on the corner case of empty ones, i could better see the case for living with inconsistent behavior.

Option 3 is obviously the best choice, though spontaneously seems a bit overkill to me? But, given #21140 I would be glad to take a look this week end and circle back with my best shot.

Copy link
Contributor Author

@pyryjook pyryjook Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I swiftly went thru the past commits related to the method and that way tried to find out the reasoning behind the implementation. First of all it's quite old method (5 years) and it has seen only a few modifications during it's lifetime. On top of that, the functionality of it does seem to be quite complexly tied to multiple use cases.

In that light, knowing its quite subtle look on the matter, it sure looks like quite fundamental task to re-think the purpose (or existence) of that method within this PR.

I completely get the point that the fix with the len() or without would only be a compromise when looking at the whole picture.

Clearly, my opinion of the complexity is affected by the fact that this is my first contribution to this library :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @ludaavics that 2 is the least desirable. I would say if you don't see an apparent solution to number 3 above then create a separate issue about the coercion of object types to float with read_json and table='orient'. After that you can parametrize the test for check_dtypes, xfailing the strict check and placing a reference via a TODO comment to the issue around coercion.

Copy link
Member

@gfyoung gfyoung Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd : Yes, that sounds like a good plan. Thanks for bearing with me. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, sounds like a plan! I'll make the new issue and the changes to the code accordingly. Thanks guys!

@gfyoung gfyoung added the Bug label Jun 6, 2018
@@ -92,7 +92,7 @@ I/O

- Bug in IO methods specifying ``compression='zip'`` which produced uncompressed zip archives (:issue:`17778`, :issue:`21144`)
- Bug in :meth:`DataFrame.to_stata` which prevented exporting DataFrames to buffers and most file-like objects (:issue:`21041`)
-
- Bug in IO JSON :func:`read_json`reading empty JSON schema with ``orient='table'`` back to :class:DataFrame caused an error (:issue:`21287`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pyryjook : If you could address the merge conflicts, that would be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I’ll certainly do that when I’m on my laptop again!

out = df.to_json(orient='table')
result = pd.read_json(out, orient='table')
# TODO: After DF coercion issue (GH 21345) is resolved, tighten type checks
tm.assert_frame_equal(expected, result,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but can you parametrize this with a "strict_check" parameter whose values can be True and False, with the former being marked as an xfail? You can see an example of this below:

None, "idx", pytest.param("index", marks=pytest.mark.xfail),

The explicit xfail gives more visibility to the issue (I'm being overly cautious here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I’ll make the change. Have to say that I appreciate your pedantics on these! 😊

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes lgtm. I'm OK with merge if tests pass - thanks!

@pyryjook
Copy link
Contributor Author

pyryjook commented Jun 8, 2018

Great, I just resolved the merge conflicts in the whatsnew file.

@@ -84,4 +85,4 @@ Reshaping

Other

- Tab completion on :class:`Index` in IPython no longer outputs deprecation warnings (:issue:`21125`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this showing up in the diff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just noticed the same. Should not be there, sry

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries! Merging branches can surprise you sometimes 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a removed newline, fixed

Copy link
Contributor Author

@pyryjook pyryjook Jun 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks! I did not have chance to fix it during the weekend, but great that it was resolved already.

@jreback jreback added this to the 0.23.1 milestone Jun 8, 2018
@jreback jreback merged commit 415012f into pandas-dev:master Jun 8, 2018
@jreback
Copy link
Contributor

jreback commented Jun 8, 2018

thanks @pyryjook

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jun 12, 2018
TomAugspurger pushed a commit that referenced this pull request Jun 12, 2018
david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Empty data frames not round-trippable to JSON
7 participants