Skip to content

Using Cloud Dataflow, read a table in BigQuery, and turns it into one file in GCS (BigQuery only support sharded exports over 1GB).

License

Notifications You must be signed in to change notification settings

polleyg/gcp-bigquery-table-to-one-file

 
 

Repository files navigation

bigquery-table-to-one-file

Using Google Cloud Dataflow, this trivial Java application reads a table in BigQuery, and turns it into one file in GCS (GZIP compressed format). Why? Because currently BigQuery only support unsharded exports of under 1 GB.

https://cloud.google.com/bigquery/docs/exporting-data https://cloud.google.com/dataflow/

It uses the default credentials set to the environment variable GOOGLE_APPLICATION_CREDENTIALS. See all about that here: https://developers.google.com/identity/protocols/application-default-credentials

In the code, change the table name and bucket details etc. to suit your needs. You will also just need to create the GCS bucket(s) yourself. I wasn't bothered making them cli parameters.

To run:

--project=<your_project_id> --runner=DataflowRunner --jobName=bigquery-table-to-one-file --maxNumWorkers=50 --zone=australia-southeast1-a --stagingLocation=gs://<your_bucket>/jars --tempLocation=gs://<your_bucket>/tmp

It should look like this when it's running. I tested it with the public WIKI table (1 billion rows & ~100GB) and it took about 6 hours using 50 n1-standard-1 workers:

alt text alt text

About

Using Cloud Dataflow, read a table in BigQuery, and turns it into one file in GCS (BigQuery only support sharded exports over 1GB).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages

  • Java 100.0%