Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write partitioned Parquet file using to_parquet #23283

Closed
ispmarin opened this issue Oct 22, 2018 · 4 comments
Closed

Write partitioned Parquet file using to_parquet #23283

ispmarin opened this issue Oct 22, 2018 · 4 comments
Labels
IO Parquet parquet, feather
Milestone

Comments

@ispmarin
Copy link

Hi,

I'm trying to write a partitioned Parquet file using the to_parquet function:

df.to_parquet('table_name', engine='pyarrow', partition_cols = ['partone', 'parttwo'])
TypeError: __cinit__() got an unexpected keyword argument 'partition_cols'

Problem description

It was my understanding that the to_parquet method pass the kwargs to Pyarrow and save a partitioned table.

Expected Output

Partitioned Parquet file saved.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-5-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 32.3.1
Cython: None
numpy: 1.15.2
scipy: 1.1.0
pyarrow: 0.11.0
xarray: None
IPython: 7.0.1
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Thanks!

@TomAugspurger
Copy link
Contributor

pandas uses pyarrow.parquet.write_table. It seems like multi-part Datasets are written using pyarrow.parquet.write_to_dataset.

I'm not sure whether it makes sense for us to (optionally) use write_to_dataset, or whether pyarrow should support partition_cols in write_table.

cc @wesm if you have thoughts here.

@TomAugspurger TomAugspurger added the IO Parquet parquet, feather label Oct 22, 2018
@xhochy
Copy link
Contributor

xhochy commented Oct 22, 2018

In the case of partition_cols, one should use write_to_dataset. write_table is much more simple/low level function.

@TomAugspurger
Copy link
Contributor

So, pandas could look for kwargs like partition_cols (any others?) and if that's detected use write_to_dataset(table, ...). That seems fine to me.

@anjsudh
Copy link
Contributor

anjsudh commented Oct 24, 2018

Will pick this up

anjsudh added a commit to anjsudh/pandas that referenced this issue Oct 25, 2018
anjsudh added a commit to anjsudh/pandas that referenced this issue Oct 26, 2018
anjsudh added a commit to anjsudh/pandas that referenced this issue Oct 26, 2018
anjsudh added a commit to anjsudh/pandas that referenced this issue Oct 27, 2018
anjsudh added a commit to anjsudh/pandas that referenced this issue Oct 27, 2018
anjsudh added a commit to anjsudh/pandas that referenced this issue Oct 27, 2018
anjsudh added a commit to anjsudh/pandas that referenced this issue Oct 27, 2018
@jreback jreback added this to the 0.24.0 milestone Oct 28, 2018
JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this issue Nov 14, 2018
tm9k1 pushed a commit to tm9k1/pandas that referenced this issue Nov 19, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this issue Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather
Projects
None yet
Development

No branches or pull requests

5 participants