BigQuery scripts to import/export GeoParquet files #113

alasarr · 2022-06-14T20:13:22Z

This PR allows:

Export to GeoParquet a dataset (via SQL query) from BigQuery.
Import in BigQuery a dataset.

It doesn't aim to be a production-ready solution, in the future, it will be supported natively by BigQuery. However, in the meantime, it provides a way to work with GeoParquet and BigQuery.

Convert a SQL query to parquet:

poetry run python bigquery_to_parquet.py \
    --input-query "SELECT * FROM carto-do-public-data.carto.geography_usa_blockgroup_2019" \
    --primary-column geom \
    --output geography_usa_blockgroup_2019

Upload a parquet file or folder to BigQuery:

poetry run python parquet_to_bigquery.py \
    --input geography_usa_blockgroup_2019 \
    --output "cartodb-gcp-backend-data-team.alasarr.geography_usa_blockgroup_2019"

I've extracted to a global module some of the code generated at #87

alasarr · 2022-06-14T20:14:57Z

scripts/pyproject.toml

@@ -6,12 +6,14 @@ authors = []
 license = "MIT"

 [tool.poetry.dependencies]
-python = "^3.8"
+python = ">=3.8,<3.11"


Google BigQuery requires this

kylebarron · 2022-06-14T20:36:19Z

scripts/bigquery_to_parquet.py

+
+    if mode.upper() == 'FOLDER':
+        # We need to export to multiple files, because a single file might hit bigquery limits (UDF out of memory). https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet
+        pq.write_to_dataset(arrow_table, root_path=output, partition_cols=['__partition__'], compression=compression)


We should still have a discussion on best practices for partitioned datasets. #79 Specifically around whether the _metadata file is required/suggested/etc.

Good point, will follow the conversation there

scripts/parquet_to_bigquery.py

kylebarron · 2022-06-14T20:40:37Z

scripts/pyproject.toml

 pyarrow = "^7.0.0"
 geopandas = "^0.10.2"
 pygeos = "^0.12.0"
 pandas = "^1.4.2"
 click = "^8.1.2"
+google-cloud-bigquery = "^3.2.0"
+db-dtypes = "^1.0.2"


Is this used? It doesn't appear to be imported

I added it because google-cloud-bigquery complained about it by asking to install it, so I did

cholmes · 2022-12-15T05:11:42Z

Closing this, as we'll move it over to https://github.com/geoparquet/bigquery-converter and keep the main geoparquet repo more focused on the spec.

BigQuery scripts to import/export GeoParquet files

06bd3bb

alasarr requested review from cholmes and Jesus89 June 14, 2022 20:13

alasarr commented Jun 14, 2022

View reviewed changes

Remove partition size

c44aa9f

kylebarron reviewed Jun 14, 2022

View reviewed changes

scripts/parquet_to_bigquery.py Outdated Show resolved Hide resolved

kylebarron reviewed Jun 14, 2022

View reviewed changes

alasarr added 2 commits June 16, 2022 00:10

Fix linter

8564d0b

Remove blank lines

9b6b815

cholmes mentioned this pull request Dec 15, 2022

Move PR from geoparquet repo to here geoparquet/bigquery-converter#1

Closed

cholmes closed this Dec 15, 2022

alasarr mentioned this pull request Dec 18, 2022

Initial commit geoparquet/bigquery-converter#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery scripts to import/export GeoParquet files #113

BigQuery scripts to import/export GeoParquet files #113

alasarr commented Jun 14, 2022 •

edited

Loading

alasarr Jun 14, 2022

kylebarron Jun 14, 2022

alasarr Jun 15, 2022

kylebarron Jun 14, 2022

alasarr Jun 15, 2022

cholmes commented Dec 15, 2022

BigQuery scripts to import/export GeoParquet files #113

BigQuery scripts to import/export GeoParquet files #113

Conversation

alasarr commented Jun 14, 2022 • edited Loading

alasarr Jun 14, 2022

Choose a reason for hiding this comment

kylebarron Jun 14, 2022

Choose a reason for hiding this comment

alasarr Jun 15, 2022

Choose a reason for hiding this comment

kylebarron Jun 14, 2022

Choose a reason for hiding this comment

alasarr Jun 15, 2022

Choose a reason for hiding this comment

cholmes commented Dec 15, 2022

alasarr commented Jun 14, 2022 •

edited

Loading