Github Archive Data Pipeline Project

If you're more comfortable with Korean, see here.

What are the most common events on GitHub? Through this project, I've learned that the most common events on GitHub are "push" events, and the repository with the most events in March was Lombiq/Orchard!

Dashboard view

Initialization

1. Install dependencies

Set up your environment to develop an Airflow DAG.

$ virtualenv venv --python=python3.10
$ source venv/bin/activate
(venv)$ pip install -r requirements.txt
(venv)$ pip install -r requirements-dev.txt
(venv)$ deactivate

2. Terraform

Terraform provisions a total of five resources.

a GCS bucket to store the extracted data
a BigQuery dataset and table looking at the above GCS bucket as external storage
a dataset and table to store the DBT output

$ cd terraform
$ terraform init
$ terraform plan -var="project=<your-gcp-project-id>"
$ terraform apply -var="project=<your-gcp-project-id>"

3. Airflow

The VM running Airflow was created from the console, not provisioned with Terraform.

I referenced Max's article on how to install Airflow on a VM. Link: https://github.com/keeyong/data-engineering-batch12/blob/main/docs/Airflow%202%20Installation.md

Details for the project

Github Archive Events Data Pipeline

Pipeline Flow
DAG View
DAG Grid View
DAG Graph View

Airflow DAG File

Fetch data from the Github Archive every 30 minutes on the hour.

From the imported data, exclude the payload field because I don't use it, convert the created_at field as a string to a timestamp, and save it as a parquet. I did this with DuckDB.

DuckDB is an open source that makes data processing easier for people familiar with SQL than the DataFrame API.

I've seen several blog posts that say it's more memory efficient than Pandas and less performant than Polars, but using Polars in Airflow throws an error similar to Exit With Return code 2. As far as I could see, there was no workaround, so I switched from Polars to DuckDB and was very happy with the results :) (Actually, it's preferable to have the compute work outside of the Airflow instance, but I decided that since I only have one DAG, it would be overhead to work outside).

The parquet file is then loaded into GCS, which stores it as a Hive partitioning so that the BigQuery external table can be partitioned. Here I specified Hive partitions for year, month, and day: (year={year}/month={month}/day={day})

This saves the cost of fetching data from the external table via the WHERE clause.

The photo below shows the partitioning applied to the external table as normal. (Bytes 279MB -> 52MB, year, month, day are partition prefixes, not columns in the existing data).

NO WHERE statement

With Where statement

I didn't build a separate CI/CD pipeline for the Airflow DAG file, I just updated the code by pushing the file and pulling it from within the VM.

$ sudo su airflow

$ cd GHArchive-Data-Pipeline-Project/ && \
  git pull && \
  cd .. && \
  cp -r GHArchive-Data-Pipeline-Project/dags/* dags/

Why Not Use Prefect?

I initially tried Prefect, but the documentation was too unfriendly.

The documentation for uploading to a VM is not currently available, and the community was too small.

I realized that I might not be able to finish my malleable project, so I took the plunge and abandoned Prefect and moved to the more familiar Airflow.

In my opinion, Prefect will never be the new replacement for Airflow (honestly, it just sucks, there's nothing better).

DBT

See here for DBT-related documentation.

Relative Links

Dataset Link: https://www.gharchive.org/

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.vscode		.vscode
dags		dags
gh_dbt		gh_dbt
terraform		terraform
.gitignore		.gitignore
pyproject.toml		pyproject.toml
readme-KR.md		readme-KR.md
readme.md		readme.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Github Archive Data Pipeline Project

Dashboard view

Initialization

1. Install dependencies

2. Terraform

3. Airflow

Details for the project

Github Archive Events Data Pipeline

Why Not Use Prefect?

DBT

Relative Links

About

Releases

Packages

Languages

Haebuk/GHArchive-Data-Pipeline-Project

Folders and files

Latest commit

History

Repository files navigation

Github Archive Data Pipeline Project

Dashboard view

Initialization

1. Install dependencies

2. Terraform

3. Airflow

Details for the project

Github Archive Events Data Pipeline

Why Not Use Prefect?

DBT

Relative Links

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages