Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for partition_cols in to_parquet #23321

Merged
merged 19 commits into from
Nov 10, 2018
Merged

Conversation

anjsudh
Copy link
Contributor

@anjsudh anjsudh commented Oct 24, 2018

  • closes #23283
  • tests passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@pep8speaks
Copy link

pep8speaks commented Oct 24, 2018

Hello @anjsudh! Thanks for updating the PR.

Comment last updated on November 05, 2018 at 19:08 Hours UTC

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Please make sure that you always have tests first and foremost with any submission though.

Once added feel free to ping back and can take another look

@WillAyd WillAyd added the IO Parquet parquet, feather label Oct 25, 2018
@codecov
Copy link

codecov bot commented Oct 25, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@11c0d28). Click here to learn what that means.
The diff coverage is 90%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #23321   +/-   ##
=========================================
  Coverage          ?   92.25%           
=========================================
  Files             ?      161           
  Lines             ?    51277           
  Branches          ?        0           
=========================================
  Hits              ?    47305           
  Misses            ?     3972           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.63% <90%> (?)
#single 42.32% <15%> (?)
Impacted Files Coverage Δ
pandas/core/frame.py 97.03% <ø> (ø)
pandas/io/parquet.py 84.76% <100%> (ø)
pandas/util/testing.py 86.71% <77.77%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11c0d28...8b45547. Read the comment docs.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a release note in 0.24.0 under enhancements? We'll also need to update the docstring for fname. IIUC, this will create a directory, rather than a single file, when partition_cols is included.

pandas/io/parquet.py Outdated Show resolved Hide resolved
pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved
pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor

The error in https://travis-ci.org/pandas-dev/pandas/jobs/446310514#L2577 suggests that we'll need to put a minimum pyarrow version to support this behavior. Can you check what version that is, and raise a with an ImportError with a nice error message if the pyarrow is too old?

@anjsudh anjsudh force-pushed the master branch 10 times, most recently from d2ec124 to c18c99c Compare October 26, 2018 18:21
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention somewhere that

doc/source/whatsnew/v0.24.0.txt Outdated Show resolved Hide resolved
pandas/io/parquet.py Outdated Show resolved Hide resolved
pandas/io/parquet.py Outdated Show resolved Hide resolved
pandas/io/parquet.py Outdated Show resolved Hide resolved
pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved
pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved
pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved
pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved
@anjsudh
Copy link
Contributor Author

anjsudh commented Oct 27, 2018

@WillAyd Hi, hope you can review the diff now? Have added the necessary tests

@xhochy
Copy link
Contributor

xhochy commented Oct 27, 2018

From a pyarrow perspective this is LGTM.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also pls update the documentatio in io.rst

pandas/io/parquet.py Show resolved Hide resolved
pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved
pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved
pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested couple of improvements to the docstring. Also, if you can run ./scripts/validate_docstrings.py pandas.DataFrame.to_parquet to see that everything else is correct in it, that would be great.

pandas/io/parquet.py Outdated Show resolved Hide resolved
pandas/io/parquet.py Outdated Show resolved Hide resolved
pandas/io/parquet.py Outdated Show resolved Hide resolved
@@ -4574,6 +4574,8 @@ Several caveats.
* Categorical dtypes can be serialized to parquet, but will de-serialize as ``object`` dtype.
* Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message
on an attempt at serialization.
* ``partition_cols`` will be used for partitioning the dataset, where the dataset will be written to multiple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the items in this lists feel more like limitations of pandas / these engines. Requiring that path be a directory when partition_cols is set doesn't seem to fit here.

I think this is important / different enough to deserve a new small section below "Handling Indexes", with

  1. A description of what partition_cols requires (list of column names, directory for file path)
  2. A description of why you might want to use partition_cols
  3. A small example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -235,6 +235,7 @@ Other Enhancements
- New attribute :attr:`__git_version__` will return git commit sha of current build (:issue:`21295`).
- Compatibility with Matplotlib 3.0 (:issue:`22790`).
- Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`)
- :func:`~DataFrame.to_parquet` now supports writing a DataFrame as a directory of parquet files partitioned by a subset of the columns. (:issue:`23283`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to mention "with the pyarrow engine (this was previously supported with fastparquet)."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Column names by which to partition the dataset
Columns are partitioned in the order they are given
The behaviour applies only to pyarrow >= 0.7.0 and fastparquet
For other versions, this argument will be ignored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it actually ignored for older pyarrows? I would have hoped it would raise when pyarrow gets the unrecognized argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it seems like we raise. Could you update this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pandas/io/parquet.py Show resolved Hide resolved
pandas/io/parquet.py Show resolved Hide resolved
pandas/io/parquet.py Outdated Show resolved Hide resolved
@anjsudh anjsudh force-pushed the master branch 2 times, most recently from d5ee5ec to 1f0978f Compare November 5, 2018 19:08
Fix"Should raise error on using partition_cols and partition_on together"
@TomAugspurger
Copy link
Contributor

Merge conflict in io/parquet.py, if you could fix that up I think this will be good to go.

doc/source/io.rst Show resolved Hide resolved
doc/source/io.rst Show resolved Hide resolved
@@ -235,6 +235,7 @@ Other Enhancements
- New attribute :attr:`__git_version__` will return git commit sha of current build (:issue:`21295`).
- Compatibility with Matplotlib 3.0 (:issue:`22790`).
- Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`)
- With the pyarrow engine, :func:`~DataFrame.to_parquet` now supports writing a DataFrame as a directory of parquet files partitioned by a subset of the columns. (:issue:`23283`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double backticks around DataFrame. say engine='pyarrow' (in double backticks)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pandas/core/frame.py Show resolved Hide resolved
pandas/io/parquet.py Show resolved Hide resolved
doc/source/io.rst Outdated Show resolved Hide resolved
@anjsudh
Copy link
Contributor Author

anjsudh commented Nov 8, 2018

@jreback @WillAyd hope you can have a look ?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

off topic comments, lgtm. @WillAyd

@@ -1984,7 +1984,10 @@ def to_parquet(self, fname, engine='auto', compression='snappy',
Parameters
----------
fname : str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side issue. we use path elsewhere for IO routines. We should change this as well (out of scope here). would have to deprecate (the name) unfortunately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we actually use path on the top-level .to_parquet, not sure how this is named this way.

doc/source/io.rst Show resolved Hide resolved
doc/source/whatsnew/v0.24.0.txt Outdated Show resolved Hide resolved
pandas/core/frame.py Show resolved Hide resolved
pandas/core/frame.py Show resolved Hide resolved
pandas/io/parquet.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 8, 2018 via email

@TomAugspurger
Copy link
Contributor

Looks good. Nice work @anjsudh!

@TomAugspurger TomAugspurger merged commit 8ed92ef into pandas-dev:master Nov 10, 2018
thoo added a commit to thoo/pandas that referenced this pull request Nov 10, 2018
* upstream/master:
  ENH: Support for partition_cols in to_parquet (pandas-dev#23321)
thoo added a commit to thoo/pandas that referenced this pull request Nov 11, 2018
…fixed

* upstream/master:
  DOC: Fixes to docstring to add validation to CI (pandas-dev#23560)
  DOC: Remove incorrect periods at the end of parameter types (pandas-dev#23600)
  MAINT: tm.assert_raises_regex --> pytest.raises (pandas-dev#23592)
  DOC: Updating Series.resample and DataFrame.resample docstrings (pandas-dev#23197)
  ENH: Support for partition_cols in to_parquet (pandas-dev#23321)
  TST: Use intp as expected dtype in IntervalIndex indexing tests (pandas-dev#23609)
JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants