This ETL pipeline extracts Youtube comments relevant to the provided channel, transforms the data through AWS Lambda and loads stage data into an AWS RDS instance PostgreSQL database.
Concluded with data visualization in Tableau dashboard.
- Airflow as orchestration (v2.5.0)
- Terraform for provisioning infrastructure
- AWS S3 as storage for raw and stage data
- AWS Lambda for data processing
- AWS RDS instance establishes PostgreSQL database
- Tableau for dashboarding
- Add configuration.env as Airflow variables (all key-value pairs from terraform outputs)
- Connect Airflow with AWS
- Call the Youtube API (specifically: https://www.googleapis.com/youtube/v3/commentThreads) and uploading as raw data (json) to S3 bucket. Creates directory: raw/
- Generate a CREATE SQL script for provided channel
- Invoke AWS Lambda to transform raw data into stage data (checks for duplicates, null values, performs sentiment analysis). Creates directory: stage/
- Check if stage data exists in S3
- Connect Airflow with the AWS RDS instance
- Run the CREATE SQL script for PostgreSQL DB: youtube_comment_db (table is named youtube_{channel_name}_data)
- Load stage data from S3 to youtube_comment_db
- Generate a ALTER SQL script
- Run the ALTER SQL script
- Docker with 4 GB of memory and Docker-Compose v1.29.1 [Needed for Airflow v2.5.0]
- Install and configure AWS CLI
- Install and configure Google Cloud CLI (gcloud)
- Create API key on Google Cloud Console with access to "Youtube Data API v3"
- Terraform for creating AWS infrastructure
Start Pipeline
make airflow-up
- initialize Airflow database migrations and start Airflowmake zip
- create AWS Lambda Layer via zip file of requirement.txtmake infra-up
- initialize and apply infrastructure using Terraform (type "yes" when prompted)make config
- generate key-value pairs for configuration.env- Run Airflow DAG
Stop Pipeline
make infra-down
- destroy infrastructuremake airflow-down
- stop Airflow by deleting containers
# Provide channel_link, the infix for the name of your table in RDS postgres and max amount of comments [max 100]
# (table name will be formatted to be PostgreSQL friendly, see SQL directory)
provide_channel_name = 'https://www.youtube.com/channel/UCJjSDX-jUChzOEyok9XYRJQ'
provide_table_infix = 'Jubilee'
provide_num_of_comments = 100