BUG: Fix empty Data frames to JSON round-trippable back to data frames #21318

pyryjook · 2018-06-04T19:23:28Z

[x] closes #21287
[x] tests added / passed
[x]passes git diff upstream/master -u -- "*.py" | flake8 --diff
[x]whatsnew entry

Fixes the bug occurring when empty DF, previously saved to JSON-file, is read from JSON back to DF.

pandas-dev#21287)

WillAyd · 2018-06-04T19:24:21Z

Can you add a test to cover this?

WillAyd

Need test and whatsnew

pyryjook · 2018-06-04T19:57:09Z

Sure, I'll add those both!

codecov · 2018-06-04T20:34:23Z

Codecov Report

Merging #21318 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #21318      +/-   ##
==========================================
- Coverage   91.89%   91.85%   -0.04%     
==========================================
  Files         153      153              
  Lines       49596    49570      -26     
==========================================
- Hits        45576    45533      -43     
- Misses       4020     4037      +17

Flag	Coverage Δ
#multiple	`90.25% <100%> (-0.04%)`	⬇️
#single	`41.86% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/json/table_schema.py	`98.29% <100%> (ø)`	⬆️
pandas/plotting/_core.py	`82.39% <0%> (-1.15%)`	⬇️
pandas/core/dtypes/missing.py	`91.95% <0%> (-0.58%)`	⬇️
pandas/core/dtypes/cast.py	`88.06% <0%> (-0.16%)`	⬇️
pandas/core/ops.py	`96.35% <0%> (-0.06%)`	⬇️
pandas/core/series.py	`94.12% <0%> (-0.06%)`	⬇️
pandas/core/reshape/merge.py	`94.25% <0%> (ø)`	⬆️
pandas/io/json/normalize.py	`96.93% <0%> (+0.06%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d5032a...8d5f127. Read the comment docs.

WillAyd

Some comments but nothing major - just need to standardize and clean up the code here a bit

WillAyd · 2018-06-04T20:48:52Z

pandas/tests/io/json/test_json_table_schema.py

@@ -86,6 +87,16 @@ def test_multiindex(self):
        assert result == expected


+class TestParseSchema(object):


Shouldn't need a new class here - can you just move this to TestTableOrientReader?

WillAyd · 2018-06-04T20:49:11Z

pandas/tests/io/json/test_json_table_schema.py

@@ -86,6 +87,16 @@ def test_multiindex(self):
        assert result == expected


+class TestParseSchema(object):
+
+    def test_empty_json_data(self):


Rename this to test_empty_frame_roundtrip

WillAyd · 2018-06-04T20:49:41Z

pandas/tests/io/json/test_json_table_schema.py

+        # GH21287
+        df = pd.DataFrame([], columns=['a', 'b', 'c'])
+        json = df.to_json(None, orient='table')
+        df = parse_table_schema(json, True)


Let's use pd.read_json here instead

WillAyd · 2018-06-04T20:53:49Z

doc/source/whatsnew/v0.23.1.txt

@@ -92,7 +92,7 @@ I/O

 - Bug in IO methods specifying ``compression='zip'`` which produced uncompressed zip archives (:issue:`17778`, :issue:`21144`)
 - Bug in :meth:`DataFrame.to_stata` which prevented exporting DataFrames to buffers and most file-like objects (:issue:`21041`)
-
+- Bug in IO JSON methods reading empty JSON schema back to DataFrame caused an error (:issue:`21287`)


Explicitly reference :func:`read_json` and make sure to qualify that this only applies when ``orient='table'``

Can also update reference to :class:`DataFrame`

WillAyd · 2018-06-04T20:55:16Z

pandas/tests/io/json/test_json_table_schema.py

+        df = pd.DataFrame([], columns=['a', 'b', 'c'])
+        json = df.to_json(None, orient='table')
+        df = parse_table_schema(json, True)
+        assert df.empty


To make sure that we've preserved the frame metadata we should use the tm.assert_frame_equal function here. Typically with tests we will:

Create a variable called expected, which is very explicit about what we want (here that's just a copy of df)

Assign to a variable called result, which here would be the result of pd.read_json

Make sure result and expected match using tm.assert_frame_equal

pyryjook · 2018-06-05T14:44:17Z

Thanks for the comments! I'll look into this later today.

pep8speaks · 2018-06-05T19:05:09Z

Hello @pyryjook! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 08, 2018 at 23:39 Hours UTC

pyryjook · 2018-06-05T19:11:12Z

pandas/tests/io/json/test_json_table_schema.py

+        expected = df.copy()
+        out = df.to_json(None, orient='table')
+        result = pd.read_json(out, orient='table')
+        tm.assert_frame_equal(expected, result)


This raises an assertion error:

E AssertionError: DataFrame.index are different E E DataFrame.index classes are not equivalent E [left]: Index([], dtype='object') E [right]: Float64Index([], dtype='float64')

That's something I need to dig deeper. If there is something obvious, that I'm missing, any pointers would be appreciated in such case.

Thanks for doing this PR! Beat me to it :)
A bit weak but what do we think of just
pd.testing.assert_frame_equal(expected, actual, check_dtype=False) ?

Otherwise I would guess we have to go down the road of including the dtypes in the JSON representation?

Good point! And actually it only works if I assert it like this:
pd.testing.assert_frame_equal(expected, result, check_dtype=False, check_index_type=False)
So both check_dtypeand check_index_type have to be set to False in order to get the assertion right.

Thoughts on this?

Ignoring the dtype difference is not the solution. The point of this format is to persist that metadata.

What I would do is check that the proper type information for the index is being written out (you can use a io.StringIO instance instead of writing to None). If that appears correct then there would be an issue with the reader that is ignoring or casting the type of the index after the fact

Yeah, you're right, ignoring the dtype and the index_type will just hide the problem.

Did some initial testings and it seems that on the reading side empty data with data.dtype == 'object' gets coerced to Float64 without any clear reason.

I'll push a commit with fix proposal for comments.

pyryjook · 2018-06-05T21:09:16Z

pandas/io/json/json.py

@@ -686,7 +686,7 @@ def _try_convert_data(self, name, data, use_dtypes=True,

        result = False

-        if data.dtype == 'object':
+        if len(data) and data.dtype == 'object':


This is the fix that seems to solve the error.

Any thoughts on this?

The point of the JSON table schema is that we can be explicit about the types of the columns, so we shouldn't need to infer anything. Any way to avoid a call to this method altogether?

The point of the JSON table schema is that we can be explicit about the types of the columns, so we shouldn't need to infer anything.

@WillAyd : I'm confused how this is relevant to the patch. It looks fine to me.

My point is that since the datatypes are explicitly defined in the schema that we shouldn't need any type of inference, which I get the impression this method is doing

My point is that since the datatypes are explicitly defined in the schema that we shouldn't need any type of inference, which I get the impression this method is doing

Sure, but I'm still not seeing relevance to this particular PR. The patch is pretty straightforward from what I see.

My personal opinion, I don't think option 2 makes much sense. As a user, I would rather have consistent, though undesirable behavior (and leave the test a bit weak, ignoring dtypes, or perhaps marking as expected failure?) than have empty and non empty data frame behave differently.
If at least non empty dataframe behaved as expected, and there was bad behavior on the corner case of empty ones, i could better see the case for living with inconsistent behavior.

Option 3 is obviously the best choice, though spontaneously seems a bit overkill to me? But, given #21140 I would be glad to take a look this week end and circle back with my best shot.

I swiftly went thru the past commits related to the method and that way tried to find out the reasoning behind the implementation. First of all it's quite old method (5 years) and it has seen only a few modifications during it's lifetime. On top of that, the functionality of it does seem to be quite complexly tied to multiple use cases.

In that light, knowing its quite subtle look on the matter, it sure looks like quite fundamental task to re-think the purpose (or existence) of that method within this PR.

I completely get the point that the fix with the len() or without would only be a compromise when looking at the whole picture.

Clearly, my opinion of the complexity is affected by the fact that this is my first contribution to this library :)

I agree with @ludaavics that 2 is the least desirable. I would say if you don't see an apparent solution to number 3 above then create a separate issue about the coercion of object types to float with read_json and table='orient'. After that you can parametrize the test for check_dtypes, xfailing the strict check and placing a reference via a TODO comment to the issue around coercion.

@WillAyd : Yes, that sounds like a good plan. Thanks for bearing with me. 😄

Great, sounds like a plan! I'll make the new issue and the changes to the code accordingly. Thanks guys!

gfyoung · 2018-06-06T16:41:22Z

doc/source/whatsnew/v0.23.1.txt

@@ -92,7 +92,7 @@ I/O

 - Bug in IO methods specifying ``compression='zip'`` which produced uncompressed zip archives (:issue:`17778`, :issue:`21144`)
 - Bug in :meth:`DataFrame.to_stata` which prevented exporting DataFrames to buffers and most file-like objects (:issue:`21041`)
-
+- Bug in IO JSON :func:`read_json`reading empty JSON schema with ``orient='table'`` back to :class:DataFrame caused an error (:issue:`21287`)


@pyryjook : If you could address the merge conflicts, that would be great.

Sure, I’ll certainly do that when I’m on my laptop again!

…df-fix

WillAyd · 2018-06-06T20:04:03Z

pandas/tests/io/json/test_json_table_schema.py

+        out = df.to_json(orient='table')
+        result = pd.read_json(out, orient='table')
+        # TODO: After DF coercion issue (GH 21345) is resolved, tighten type checks
+        tm.assert_frame_equal(expected, result,


Minor nit but can you parametrize this with a "strict_check" parameter whose values can be True and False, with the former being marked as an xfail? You can see an example of this below:

pandas/pandas/tests/io/json/test_json_table_schema.py

Line 498 in 0c65c57

None, "idx", pytest.param("index", marks=pytest.mark.xfail),

The explicit xfail gives more visibility to the issue (I'm being overly cautious here)

Sure, I’ll make the change. Have to say that I appreciate your pedantics on these! 😊

WillAyd

Changes lgtm. I'm OK with merge if tests pass - thanks!

…df-fix

pyryjook · 2018-06-08T03:55:34Z

Great, I just resolved the merge conflicts in the whatsnew file.

gfyoung · 2018-06-08T03:56:52Z

doc/source/whatsnew/v0.23.1.txt

@@ -84,4 +85,4 @@ Reshaping

 Other

- Tab completion on :class:`Index` in IPython no longer outputs deprecation warnings (:issue:`21125`)


Why is this showing up in the diff?

Yeah, just noticed the same. Should not be there, sry

No worries! Merging branches can surprise you sometimes 😄

there was a removed newline, fixed

Great, thanks! I did not have chance to fix it during the weekend, but great that it was resolved already.

jreback · 2018-06-08T23:40:10Z

thanks @pyryjook

pandas-dev#21318) (cherry picked from commit 415012f)

#21318) (cherry picked from commit 415012f)

pandas-dev#21318)

BUG: Fix empty Data frames to JSON round-trippable back to data frames (

6dfd976

pandas-dev#21287)

WillAyd requested changes Jun 4, 2018

View reviewed changes

WillAyd added the IO JSON read_json, to_json, json_normalize label Jun 4, 2018

Add test and whatsnew

466e5a6

Empty line between test classes

db3a738

WillAyd requested changes Jun 4, 2018

View reviewed changes

pyryjook mentioned this pull request Jun 5, 2018

Empty data frames not round-trippable to JSON #21287

Closed

Changes based on review comments

5844301

pyryjook commented Jun 5, 2018

View reviewed changes

Pyry Kovanen added 3 commits June 5, 2018 22:35

Fix whatsnew + PEP

fd8fa93

Prevent empty data from being coerced to float64

2f347c0

Remove debugging messages

28d6e05

pyryjook commented Jun 5, 2018

View reviewed changes

Remove obsolete imports from tests

743c08f

gfyoung added the Bug label Jun 6, 2018

gfyoung reviewed Jun 6, 2018

View reviewed changes

Pyry Kovanen added 3 commits June 6, 2018 20:23

Merge remote-tracking branch 'upstream/master' into empty-json-empty-…

833afea

…df-fix

Loosen test type checks, remove length check from JSON parser

2461b90

Add GH issue number to TODO comment

03a2b8a

WillAyd reviewed Jun 6, 2018

View reviewed changes

WillAyd mentioned this pull request Jun 6, 2018

read_json with table='orient' causes unexpected type coercion #21345

Closed

Parametrize JSON roundtrip test with xfail mark

fc15ba0

WillAyd approved these changes Jun 7, 2018

View reviewed changes

pyryjook force-pushed the empty-json-empty-df-fix branch from 7faa509 to af52815 Compare June 8, 2018 03:52

Merge remote-tracking branch 'upstream/master' into empty-json-empty-…

0a26bf8

…df-fix

pyryjook force-pushed the empty-json-empty-df-fix branch from af52815 to 0a26bf8 Compare June 8, 2018 03:52

gfyoung reviewed Jun 8, 2018

View reviewed changes

jreback added this to the 0.23.1 milestone Jun 8, 2018

jreback added 2 commits June 8, 2018 19:37

Merge branch 'master' into PR_TOOL_MERGE_PR_21318

ecc631a

fix whatsnew

8d5f127

jreback merged commit 415012f into pandas-dev:master Jun 8, 2018

jreback added the Needs Backport label Jun 8, 2018

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jun 12, 2018

BUG: Fix empty Data frames to JSON round-trippable back to data frames (

c64d52f

pandas-dev#21318) (cherry picked from commit 415012f)

TomAugspurger pushed a commit that referenced this pull request Jun 12, 2018

BUG: Fix empty Data frames to JSON round-trippable back to data frames (

4807bce

#21318) (cherry picked from commit 415012f)

TomAugspurger removed the Needs Backport label Jun 12, 2018

david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018

BUG: Fix empty Data frames to JSON round-trippable back to data frames (

20b1b00

pandas-dev#21318)

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

BUG: Fix empty Data frames to JSON round-trippable back to data frames (

4591320

pandas-dev#21318)

		@@ -86,6 +87,16 @@ def test_multiindex(self):
		assert result == expected


		class TestParseSchema(object):

		@@ -84,4 +85,4 @@ Reshaping

		Other

		- Tab completion on :class:`Index` in IPython no longer outputs deprecation warnings (:issue:`21125`)

BUG: Fix empty Data frames to JSON round-trippable back to data frames #21318

BUG: Fix empty Data frames to JSON round-trippable back to data frames #21318

Conversation

pyryjook commented Jun 4, 2018

WillAyd commented Jun 4, 2018

WillAyd left a comment

Choose a reason for hiding this comment

pyryjook commented Jun 4, 2018

codecov bot commented Jun 4, 2018 • edited Loading

Codecov Report

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pyryjook commented Jun 5, 2018

pep8speaks commented Jun 5, 2018 • edited Loading

Comment last updated on June 08, 2018 at 23:39 Hours UTC

pyryjook Jun 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pyryjook Jun 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pyryjook Jun 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Jun 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Jun 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pyryjook Jun 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Jun 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

pyryjook commented Jun 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pyryjook Jun 10, 2018 • edited Loading

Choose a reason for hiding this comment

jreback commented Jun 8, 2018

codecov bot commented Jun 4, 2018 •

edited

Loading

pep8speaks commented Jun 5, 2018 •

edited

Loading

pyryjook Jun 5, 2018 •

edited

Loading

pyryjook Jun 5, 2018 •

edited

Loading

pyryjook Jun 5, 2018 •

edited

Loading

gfyoung Jun 6, 2018 •

edited

Loading

gfyoung Jun 6, 2018 •

edited

Loading

pyryjook Jun 6, 2018 •

edited

Loading

gfyoung Jun 6, 2018 •

edited

Loading

pyryjook Jun 10, 2018 •

edited

Loading