Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery scripts to import/export GeoParquet files #113

Closed
wants to merge 4 commits into from

Conversation

alasarr
Copy link
Collaborator

@alasarr alasarr commented Jun 14, 2022

This PR allows:

  • Export to GeoParquet a dataset (via SQL query) from BigQuery.
  • Import in BigQuery a dataset.

It doesn't aim to be a production-ready solution, in the future, it will be supported natively by BigQuery. However, in the meantime, it provides a way to work with GeoParquet and BigQuery.

Convert a SQL query to parquet:

poetry run python bigquery_to_parquet.py \
    --input-query "SELECT * FROM carto-do-public-data.carto.geography_usa_blockgroup_2019" \
    --primary-column geom \
    --output geography_usa_blockgroup_2019 

Upload a parquet file or folder to BigQuery:

poetry run python parquet_to_bigquery.py \
    --input geography_usa_blockgroup_2019 \
    --output "cartodb-gcp-backend-data-team.alasarr.geography_usa_blockgroup_2019"

I've extracted to a global module some of the code generated at #87

@alasarr alasarr requested review from cholmes and Jesus89 June 14, 2022 20:13
@@ -6,12 +6,14 @@ authors = []
license = "MIT"

[tool.poetry.dependencies]
python = "^3.8"
python = ">=3.8,<3.11"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google BigQuery requires this


if mode.upper() == 'FOLDER':
# We need to export to multiple files, because a single file might hit bigquery limits (UDF out of memory). https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet
pq.write_to_dataset(arrow_table, root_path=output, partition_cols=['__partition__'], compression=compression)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still have a discussion on best practices for partitioned datasets. #79 Specifically around whether the _metadata file is required/suggested/etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will follow the conversation there

pyarrow = "^7.0.0"
geopandas = "^0.10.2"
pygeos = "^0.12.0"
pandas = "^1.4.2"
click = "^8.1.2"
google-cloud-bigquery = "^3.2.0"
db-dtypes = "^1.0.2"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used? It doesn't appear to be imported

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it because google-cloud-bigquery complained about it by asking to install it, so I did

@cholmes
Copy link
Member

cholmes commented Dec 15, 2022

Closing this, as we'll move it over to https://github.com/geoparquet/bigquery-converter and keep the main geoparquet repo more focused on the spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants