Movies-ETL

Overview

I have been tasked with creating an automated ETL pipeline that creates a Movie database from both structured and unstructured data. Code is written that extracts Wikipedia JSON data, Kaggle metadata, and the MovieLens rating data, transforms the data with regex and functions, then lastly, loads the data into a PostgreSQL database.

Resources

Extract - ETL_function_test.ipynb
Transform - ETL_clean_wiki_movies.ipynb & ETL_clean_kaggle_data.ipynb
Load - ETL_create_database.ipynb
Languages: Python, SQL
Tools: Jupyter Notebook, JSON, Pandas, NumPy, sqlalchemy, psycopg2
Data Source(s): wikipedia.movies.json, movies_metadata.csv, ratings.csv (note: due to large file size, I am unable to push ratings.csv to repo)

Results

The resulting database consists of two tables that are highlighted below.

A movies table with 6,052 rows of data, each representing a unique movie title. The table also includes 31 columns of data with production related information, such as, movie budget, producers, actors and release data.

A ratings table a table with 26,024,289 rows of data, each representing a viewer rating of 1-5. The table also includes 5 columns of data such as movie title, rating and date.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Movies-ETL

Overview

Resources

Results

Files

README.md

Latest commit

History

README.md

File metadata and controls

Movies-ETL

Overview

Resources

Results