Skip to content

Lightweight solution to perform ETL using Python, S3 and Redshift.

Notifications You must be signed in to change notification settings

mt592/python-redshift-etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python - Redshift ETL

Workflow Diagram

This workflow is intended to provide a quick solution to performing data transformation in python and quickly uploading the transformed dataset into a Redshift database. I created it as the frequent need arose to upload data regularly to Redshift databases via either simple excel files, or via transformations of existing raw data in Redshift. This workflow can be modified to support other database types if possible. Some advantages of this workflow:

  • Improve upon the notoriously slow to_sql function in Pandas. While to_sql is an easy way to create and append data to tables in Redshift, it operates incredibily slowly even with advanced functionality activated (such as setting method=multi). This script takes advantage of S3 COPY statements which can bring upload time down from hours to seconds.

  • Leverage pre-written SQL scripts to save time and reduce code redundancy. Oftentimes when creating an ETL pipeline, developers will start out in a database IDE writing SQL to create schema structures, access existing raw data, and grant permissions to users. This is a natural starting point that produces valuable code that ideally could be leveraged as-is in an ETL pipeline (without the need to embed SQL statements individually within Python code).

  • Keep credentials managed with best-practices. Usernames, passwords, and connections should never be hardcoded within code itself, but instead should be placed in configuration files as key-value pairs.

  • Run jobs on a pre-determined schedule, hassle free. This works best if your python instance lives on a remote server and can therefore take full advantage of Cron scheduling.

This workflow has proven its use across multiple projects as a quick self-serve method for performing ETL without too much overhead.

Getting Started

  • Clone the repository:
git clone https://github.com/mt592/python-redshift-etl.git
  • Enter read-write credentials to your Redshift database and S3 bucket in /reference/config.ini:

  • Make changes to /src/extract_data.py and /src/transform_data.py to fit the needs of your pipeline.

  • Run the pipeline:

python run_pipeline.py "./reference/data_file.csv"

About

Lightweight solution to perform ETL using Python, S3 and Redshift.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published