ETL Pipeline with API Trigger

This repository contains a simple ETL (Extract, Transform, Load) pipeline with an API trigger. The pipeline processes CSV files, derives features, and populates a PostgreSQL database table.

Introduction

This project demonstrates the creation of a basic ETL pipeline using Python, Flask, Docker, and PostgreSQL. The primary objective is to extract data from CSV files, perform feature transformation, and load the processed data into a PostgreSQL database. The ETL process can be initiated via an API endpoint.

Prerequisites

Python 3.7+
Docker
PostgreSQL

Setup

Clone the Repository:

git clone https://github.com/Nikhil9786/etl-app.git
cd etl-app

Install Dependencies:
```
pip install -r requirements.txt
```
Database Configuration:
- Assuming PostgreSQL is installed, I created a database that will store the processed data using psql.
```
CREATE DATABASE postgres_db;
```
- To securely interact with password, it is essential to create user password
```
CREATE USER user WITH PASSWORD 'passuser';
```
- Granting appropriate permissions to user
```
GRANT ALL PRIVILEGES ON DATABASE postgres_db TO user;
```
Code Flow:
- ETL Process(etl.py):
  - Data Extraction from CSV files
    - The pipeline ingests data from three CSV files: users.csv, user_experiments.csv, and compounds.csv.
    - These files contain information about users, their experiments, and the compounds involved.
  - Data Transformation:
    - The data from user_experiments.csv is transformed to derive features.
    - Features include:
    - Total experiments conducted by each user.
    - Average number of experiments per user.
    - The most commonly experimented compound.
  - Data Poupulation
    - Transformed data is loaded into a PostgreSQL database.
    - The average_experiments table stores user IDs and their average experiment counts.
    - The most_common_compound table stores the ID of the most commonly experimented compound.
- Flask API(app.py):
  
  The Flask API serves as the command center for the ETL pipeline. It provides a mechanism to initiate the ETL process and showcases the integration of API technology with the ETL workflow.
  - A custom API endpoint (/trigger_etl) is defined using the Flask framework.
  - An HTTP POST request to this endpoint triggers the ETL process.
- Dockerization:
  - Dockerize the application to ensure consistent deployment across different environments.
  - Utilize a Dockerfile to define the container environment and dependencies.
  - Use the Docker CLI to build and run the Docker container.
How to Run and Test:
- Build and run the Docker container by executing the following command:
```
chmod +x docker.sh
./docker.sh
```
- Initiate the ETL process by sending an HTTP POST request to the API endpoint. Use the following command:
```
chmod +x run_etl.sh
./run_etl.sh
```
- After the ETL process completes, you can query the populated database for the derived features. Run the following command:
```
chmod +x query_database.sh
./query_database.sh
```
- Created a shell script to run everything with one command:
```
chmod +x run_app.sh
./run_app.sh
```
Future Work
- Automated Testing: Develop comprehensive unit tests and integration tests to verify the correctness of each component in the ETL pipeline.
- Logging and Monitoring:
  - Implement robust logging mechanisms to track the ETL pipeline's progress and identify potential issues.
  - Utilize monitoring tools to gain insights into the pipeline's health and performance.

NOTE

Make sure the path for data files, names of database and user credentials, and ports are changed accordingly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL Pipeline with API Trigger

Introduction

Prerequisites

Setup

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data		data
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
docker.sh		docker.sh
etl.py		etl.py
query_database.sh		query_database.sh
requirements.txt		requirements.txt
run_app.sh		run_app.sh
run_etl.sh		run_etl.sh

Nikhil9786/etl-app

Folders and files

Latest commit

History

Repository files navigation

ETL Pipeline with API Trigger

Introduction

Prerequisites

Setup

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages