Skip to content

jaydayno/youtube_comments_pipeline

Repository files navigation

Youtube Comments ETL Pipeline

Processing text data from the Youtube API while applying sentiment analysis on comments

This ETL pipeline extracts Youtube comments relevant to the provided channel, transforms the data through AWS Lambda and loads stage data into an AWS RDS instance PostgreSQL database.

Concluded with data visualization in Tableau dashboard.

Table of Contents

Architecture

Technologies

DAG Task Summary

  1. Add configuration.env as Airflow variables (all key-value pairs from terraform outputs)
  2. Connect Airflow with AWS
  3. Call the Youtube API (specifically: https://www.googleapis.com/youtube/v3/commentThreads) and uploading as raw data (json) to S3 bucket. Creates directory: raw/
  4. Generate a CREATE SQL script for provided channel
  5. Invoke AWS Lambda to transform raw data into stage data (checks for duplicates, null values, performs sentiment analysis). Creates directory: stage/
  6. Check if stage data exists in S3
  7. Connect Airflow with the AWS RDS instance
  8. Run the CREATE SQL script for PostgreSQL DB: youtube_comment_db (table is named youtube_{channel_name}_data)
  9. Load stage data from S3 to youtube_comment_db
  10. Generate a ALTER SQL script
  11. Run the ALTER SQL script

Requirements

Usage

Start Pipeline

  1. make airflow-up - initialize Airflow database migrations and start Airflow
  2. make zip - create AWS Lambda Layer via zip file of requirement.txt
  3. make infra-up - initialize and apply infrastructure using Terraform (type "yes" when prompted)
  4. make config - generate key-value pairs for configuration.env
  5. Run Airflow DAG


Stop Pipeline

  1. make infra-down - destroy infrastructure
  2. make airflow-down - stop Airflow by deleting containers

Example Dashboard

Channel Example: Jubilee

# Provide channel_link, the infix for the name of your table in RDS postgres and max amount of comments [max 100]
# (table name will be formatted to be PostgreSQL friendly, see SQL directory)
provide_channel_name = 'https://www.youtube.com/channel/UCJjSDX-jUChzOEyok9XYRJQ'
provide_table_infix = 'Jubilee'
provide_num_of_comments = 100 

Link to Public Tableau Dashboard

Tablular Data

Statistics Summary