Extensive MLOps

Overview

This repository is an implementation of all the sessions covered as part of EMLO V3.0 course. Course Syllabus

Main Technologies used

PyTorch Lightning - a lightweight PyTorch wrapper for high-performance AI research. Think of it as a framework for organizing your PyTorch code.
Docker - an open platform for developing, shipping, and running applications. Docker enables you to separate your applications from your infrastructure so you can deliver software quickly
Hydra - a framework for elegantly configuring complex applications. The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line.
DVC - a command line tool to help you develop reproducible machine learning projects by versioning your data and models.
MLFlow - a platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models
Optuna - an automatic hyperparameter optimization software framework, particularly designed for machine learning
TorchScript - a way to create serializable and optimizable models from PyTorch code.
TorchTrace - a way to trace a function and return an executable or ScriptFunction that will be optimized using just-in-time compilation.
Gradio - an open-source Python library that is used to quickly build machine learning and data science demos and web applications.
FastAPI - a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints.
Locust - an easy to use, scriptable and scalable performance testing tool.
AWS ECS - a highly scalable, fast, container management service that makes it easy to run, stop, and manage Docker containers on a cluster of Amazon EC2 instances.
AWS Fargate - a technology that you can use with Amazon ECS to run containers without having to manage servers or clusters of Amazon EC2 instances.
AWS ECR - a managed AWS Docker registry service that can be used with ECS.
AWS Lambda - a compute service that lets you run code without provisioning or managing servers.
TorchServe - a flexible and easy to use tool for serving and scaling PyTorch models in production.
Onnx - an open source format for AI models. Widely supported and can be found in many frameworks, tools, and hardware
WASM - WebAssembly (abbreviated Wasm) is a binary instruction format for compiling and executing code in a client-side web browser
Captum - A model interpretability and understanding library for PyTorch
Shap - SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model

Session 4 : How to run on local
Session 4 : Run Training and Evaluation using Docker on Cifar10 using TIMM models with Pytorch lightning and Hydra
Session 5 : How to Push and Pull Data and Models using DVC
Session 5 : Run Inference on Kaggle's cats and dogs dataset using Vit Transformer model
Session 6 : Train using Hydra multirun with Joblib launcher (Dataset cifar10, Model Vit, patch_size 1,2,4,8,16) and View Runs in MLflow
Session 7 : HyperParameter Optimization using Optuna and Hydra Multirun (Dataset HarryPotter, Model GPT)
Session 8 : Gradio Demo with Torch Script model (Dataset Cifar10, Model Vit)
Session 8 : Gradio Demo with Torch Trace model (Dataset HarryPotter, Model GPT)
Session 9 : Gradio Demo with GPT Traced model (Dataset HarryPotter, Model GPT) on AWS using ECR, ECS, S3
Session 10 : FastAPI Demo with Docker
Session 11 : Deploy CLIP with Docker and FastAPI on ECS Fargate (Multi Replicas) and Stress Test with Locust
Session 12 : Deploy ImageNet Classifier on AWS Lambda using Docker, FastAPI with Frontend in Nextjs on Vercel
Session 13 : Deploy a Stable Diffusion XL model using TorchServe on FastAPI backend and Nextjs Frontend
Session 14 : Deploy Yolov8 Detection model in Onnx format on browser using WASM backend and Nextjs on Vercel frontend
Session 15 : How to perform model interpretability for CV and NLP tasks

How to Run on Local

Installation

Pip

# clone project
git clone https://github.com/RSWAIN1486/emlov3-pytorchlightning-hydra.git
cd emlov3-pytorchlightning-hydra

# [OPTIONAL] create conda environment
conda create -n myenv python=3.9
conda activate myenv

# install pytorch according to instructions
# https://pytorch.org/get-started/

# install requirements
pip install -r requirements.txt

Dev Mode

pip install -e .

Train Model with default/cpu configuration

# train on CPU
python src/train.py trainer=cpu
python src/eval.py

# You can override any parameter from command line like this
python src/train.py trainer.max_epochs=20 data.batch_size=64

Run Training and Evaluation using Docker on Cifar10 using TIMM models with Pytorch lightning and Hydra

# Build Docker on local
docker build -t emlov3-pytorchlightning-hydra .
# or pull from Docker hub
docker pull rswain1486/emlov3-pytorchlightning-hydra:latest

# Since checkpoint will not be persisted between container runs if train and eval are run separately, use below command to run together. 
docker run rswain1486/emlov3-pytorchlightning-hydra sh -c "python3 src/train.py andand python3 src/eval.py"

# Using volume you can mount checkpoint to host directory and run train and eval separately.
docker run --rm -t -v ${pwd}/ckpt:/workspace/ckpt rswain1486/emlov3-pytorchlightning-hydra python src/train.py
docker run --rm -t -v ${pwd}/ckpt:/workspace/ckpt rswain1486/emlov3-pytorchlightning-hydra python src/eval.py

Post evaluation, you should see test metrics as below :

How to Push and Pull Data and Models using DVC

# Track and update your data by creating or updating data.dvc file.
dvc add data

# To push to google drive, create folder under gdrive and add the remote to local using folder id.
dvc remote add --default gdrive gdrive://1WcXEK-HjdaQ-xZp6NOnSUqprPGhijFeE

# Push latest data to dvc source - google drive using
dvc push -r gdrive

# Pull data tracked by dvc from source - google drive using
dvc pull -r gdrive

# To switch between versions of code and data run
git checkout master
dvc checkout

# To automate the dvc checkout everytime a git checkout is done run
dvc install

Run Training and Inference on Kaggle cats and dogs dataset using Vit Transformer model

# Training
# If installed using dev mode, run infer with experiment/cat_dog_infer.yaml using
src_train experiment=cat_dog trainer.max_epochs=1 datamodule.batch_size=64 datamodule.num_workers=0

# If installed using requirements.txt, use
python src/train.py experiment=cat_dog trainer.max_epochs=1 datamodule.batch_size=64 datamodule.num_workers=0

# Inference
# If installed using dev mode, run infer with experiment/cat_dog_infer.yaml using
src_infer experiment=cat_dog_infer test_path=./data/PetImages_split/test/Cat/18.jpg

# If installed using requirements.txt, use
python src/infer.py experiment=cat_dog_infer test_path=./data/PetImages_split/test/Cat/18.jpg

Predictions for Top k classes (here 2) should show as below

Train using Hydra multirun with Joblib launcher and View Runs in MLflow

# Build Docker on local
docker build -t lightning-hydra-multiexperiments .
# or pull from Docker hub
docker pull rswain1486/lightning-hydra-experimenttracking:latest

# Run below command to start the patch size experiment using hydra joblib launcher.
# NOTE: Make sure to add port mapping from container to host if you would like to view MLflow Logger UI during runtime
# NOTE: Make sure to add volume mapping of local host directory to container workspace directory to save logs, models on local for dvc tracking.
docker run -it --expose 5000 -p 5000:5000 -v ${pwd}:/workspace --name mlflow-container lightning-hydra-experimenttracking:latest \
src_train -m hydra/launcher=joblib hydra.launcher.n_jobs=5 experiment=cifar10 model.patch_size=1,2,4,8,16 datamodule.num_workers=0

# Run below command to start MLFlow Logger server inside the container and open http://localhost:5000 on your browser
docker exec -it -w /workspace/logs/mlflow mlflow-container mlflow ui --host 0.0.0.0

# Post the container is exited, you can start MLFlow Logger server using and open http://localhost:5000 on your browser
cd ./logs/mlflow
mlflow ui

# Add dvc tracking to data, logs and models. (Models are saved under logs for mlflow)
dvc add data
dvc add logs
dvc config core.autostage true

git add data.dvc
git add logs.dvc

# To push to google drive, refer to the section - How to push and pull data using DVC

View Multi runs in MLflow

Scatter plot of patch_size vs val/acc in MLflow

Single run directory structure under logs/mlflow saving models, metrics, metadata etc.

HyperParameter Optimization using Optuna and Hydra Multirun

# Find the Best Learning Rate and Batch size using Lightning Tuner
src_train -m tuner=True train=False test=False datamodule.num_workers=2 experiment=harrypotter

# Run Hyperparameter Search using Optuna using Hydra config file
src_train -m test=False datamodule.num_workers=2 experiment=harrypotter hparams_search=harrypotter_optuna

# Load MLFlow logger UI to compare HyperParameter Experiments
cd logs/mlflow
mlflow ui

# Run Training for n epochs with Best HyperHarameters
src_train -m tuner=True test=False trainer.max_epochs=10 datamodule.num_workers=2 experiment=harrypotter datamodule.block_size=8 \
model.block_size=8 model.net.block_size=8 model.net.n_embed=256 model.net.n_heads=8 model.net.drop_p=0.15 model.net.n_decoder_blocks=4

Scatter plot of different HyperParameters across Experiments in MLflow

Generate text using GPT for Harry Potter using best hyperparams

Gradio Demo with TorchScript model

# Install in dev mode
pip install -e .

# Train the Vit model on Cifar10 and save as TorchScript model. Set save_torchscript to True in configs/train.yaml
src_train experiment=cifar10_jit save_torchscript=True

# Infer on a test image using Torchscript model
src_infer_jit_script_vit test_path=./test/0000.jpg

# Launch Gradio Demo for Cifar10 at port 8080 and open http://localhost:8080/
# NOTE: Set the ckpt_path and labels_path in configs/infer_jit_script_vit.yaml
src_demo_jit_script_vit

# Build and Launch Gradio Demo using Docker. This should launch demo at http://localhost:8080/. Ensure to expose the port in docker-compose/ DockerFile.demo
docker compose  -f docker-compose.yml up --build demo_cifar_gradio

# Launch Gradio demo by pulling from Dockerhub
docker run -p 8080:8080 rswain1486/gradio-cifar10-demo:latest

# To stop the demo, if the Ctrl + C does not work, use
docker stop $(docker ps -aq)

Gradio UI for Cifar10

Gradio Demo with Torch Trace model

# Install in dev mode
pip install -e .

# Train the GPT model on HarryPotter dataset and save as Torch traced model. Set save_torchtrace to True in configs/train.yaml
src_train -m experiment=harrypotter_jit.yaml test=False trainer.max_epochs=20 trainer.accelerator=gpu save_torchtrace=True paths.ckpt_jittrace_save_path=ckpt/gpt_torch_traced.pt

# Generate text using Torch traced model
src_infer_jit_trace_gpt ckpt_path=ckpt/gpt_torch_traced.pt input_txt='Avada Kedavra'

# Launch Gradio demo
src_demo_jit_trace_gpt ckpt_path=ckpt/gpt_torch_traced.pt

Gradio UI for generating Harry Potter text using GPT

Gradio Demo with GPT Traced model

# Build and Launch Gradio Demo using Docker. This should launch demo at http://localhost:80/
docker compose  -f docker-compose.yml up --build demo_gpt_gradio

# Test using
python3 src/gradio/test_demo_jit_script_gpt.py

# If aws is configured, push the model to S3 using. Set the bucket_name and model_file_path.
python3 src/aws/push_model_S3.py

# To push your docker image to ECR, run below commands
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin <ecr repo uri>
docker build -t <repo-name> .
docker tag <repo-name>:latest <ecr repo uri>/<repo-name>:latest
docker push <ecr repo uri>/<repo-name>:latest

FastAPI Demo with Docker

# Build and launch FastAPI using Docker. This should launch demo at http://localhost:8080/docs or http://<ec2-public-ip>/docs
# for GPT
docker-compose  -f docker-compose.yml up --build demo_gpt_fastapi

# for VIT
docker-compose  -f docker-compose.yml up --build demo_vit_fastapi

# To generate a log file with individual and average response time for 100 api requests.
# for GPT. Set the server url and log file path accordingly.
python3 src/fastapi/gpt/test_api_calls_gpt.py

# for VIT. Set the server url, input image file and log file path accordingly.
python3 src/fastapi/vit/test_api_calls_vit.py

Average response time for GPT

Average response time for VIT

CPU usage with 2 workers for GPT

Deploy CLIP with Docker and FastAPI on ECS Fargate and Stress Test with Locust

# Build and launch CLIP using FastAPI. This should launch demo at http://localhost:80/ or http://<ec2-public-ip>:80/docs

docker-compose  -f docker-compose.yml up --build demo_clip_fastapi

# If deployed using docker image on AWS ECS Fargate using load balancer, it should launch at http://<DNS-of-load-balancer>/docs

# To start locust server and start swarming. By default, server should start at http://localhost:8089/ or http://<ec2-public-ip>:8089/
python3 src/clip/locust_stress_test_clip.py

# Frontend
# To install nvm
curl -o- https://github.com/raw/nvm-sh/nvm/v0.39.4/install.sh | bash
nvm install 16
nvm use 16
cd src/clip/clip-frontend
npx create-next-app@latest clip-frontend

# To start the clip front end, edit the page.tsx under clip-frontend/app accordingly and run
npm run dev

Locust Stress Test with CLIP deployed on ECS Fargate

CLIP Deployed on frontend

Deploy ImageNet Classifier on AWS Lambda

# Build and launch ImageNet Classifier using FastAPI with Mangum wrapper. This should launch demo at http://localhost:8080/ or http://<ec2-public-ip>:8080/docs

docker-compose  -f docker-compose.yml up --build demo_lambda_fastapi

# Push docker image to AWS Private repository only as Lambda supports private repo only
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <ecr repo uri>
docker build -t <repo-name> .
docker tag <repo-name>:latest <ecr repo uri>/<repo-name>:latest
docker push <ecr repo uri>/<repo-name>:latest

# Create API endpoint in AWS Lambda. Should be of the format : https://{restapi_id}.execute-api.{region}.amazonaws.com/{stage_name}/

# Frontend - Create a new github repo for the front end for Vercel deployment.
# To install nvm
curl -o- https://github.com/raw/nvm-sh/nvm/v0.39.4/install.sh | bash
nvm install 16
nvm use 16
cd src/aws/lambda/lambda-frontend
npx create-next-app@latest lambda-frontend

# To start the clip front end, edit the page.tsx, layout.tsx, tailwind.config.ts files accordingly and run
npm run dev

Git Repository : Vercel Frontend for Lambda backend

Web Frontend : Vercel app

ImageNet Classifier Deployed on Vercel with AWS Lambda

Deploy Stable Diffusion using TorchServe

# Navigate to TorchServe
cd src/torchserve_sdxl

# Download the SDXL model and its artifacts to local. It would be download to a folder named sdxl-1.0-model
python3 download_model.py

# Zip the model artifacts
zip -0 -r ./sdxl-1.0-model.zip sdxl-1.0-model/*

# To create the MAR file (sdxl.mar) 
docker run -it --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v `pwd`:/opt/src pytorch/torchserve:0.8.1-gpu torchserve bash
torch-model-archiver --model-name sdxl --version 1.0 --handler sdxl_handler.py --extra-files sdxl-1.0-model.zip -r requirements.txt

# Once .mar file has been created, install nvm
curl -o- https://github.com/raw/nvm-sh/nvm/v0.39.4/install.sh | bash
nvm install 16
nvm use 16

# Now run docker compose, to start TorchServe, FastAPI and Frontend server packaged together.
# FastAPI server will run at http://<ec2-public-ip>:9080 and frontend server at http://<ec2-public-ip>:3000
docker compose up

Stable Diffusion XL Frontend

Stable Diffusion Logs during Inference

Deploy YoloV8 on browser

# Navigate to TorchServe
cd src/yolo_onnx_browser

# Download the YoloV8s model and convert to onnx format
python3 yolo_to_onnx.py

# Run below to create the frontend
curl -o- https://github.com/raw/nvm-sh/nvm/v0.39.4/install.sh | bash
nvm install 16
nvm use 16
npx create-next-app@latest frontend

# Create model folder under frontend/public
mkdir front/public/model

# Copy the required yolo models to model folder
cp nms-yolov8.onnx yolov8s.onnx frontend/public/model

# To start the Yolov8 front end, edit the page.tsx, layout.tsx, tailwind.config.ts files accordingly and run
npm run dev

Git Repository : YoloV8 on browser

Web Frontend : Vercel app

YoloV8 on browser

Model Explainability in CV and NLP

# Navigate to explainable ai
cd src/explainable_ai

# For NLP, install the reqd libraries.
pip install --quiet transformers shap sentencepiece datasets einops accelerate

# Run the explainable_NLP.ipynb and store NLP output screenshots under output/nlp

# Install reqd libraries for CV
pip install timm shap grad-cam captum

# Add input images under src/explainable_ai/input and run explain_ai.py. It will create a explainability.md
python3 explain_ai.py

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.devcontainer		.devcontainer
.dvc		.dvc
configs		configs
examples		examples
scripts		scripts
src		src
test		test
.dvcignore		.dvcignore
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
data.dvc		data.dvc
docker-compose.yml		docker-compose.yml
logs.dvc		logs.dvc
requirements.txt		requirements.txt
setup.py		setup.py

License

RSWAIN1486/emlov3-pytorchlightning-hydra

Folders and files

Latest commit

History

Repository files navigation

Extensive MLOps

Overview

Main Technologies used

Table of Contents

How to Run on Local

Installation

Pip

Dev Mode

Train Model with default/cpu configuration

Run Training and Evaluation using Docker on Cifar10 using TIMM models with Pytorch lightning and Hydra

Post evaluation, you should see test metrics as below :

How to Push and Pull Data and Models using DVC

Run Training and Inference on Kaggle cats and dogs dataset using Vit Transformer model

Predictions for Top k classes (here 2) should show as below

Train using Hydra multirun with Joblib launcher and View Runs in MLflow

View Multi runs in MLflow

Scatter plot of patch_size vs val/acc in MLflow

Single run directory structure under logs/mlflow saving models, metrics, metadata etc.

HyperParameter Optimization using Optuna and Hydra Multirun

Scatter plot of different HyperParameters across Experiments in MLflow

Generate text using GPT for Harry Potter using best hyperparams

Gradio Demo with TorchScript model

Gradio UI for Cifar10

Gradio Demo with Torch Trace model

Gradio UI for generating Harry Potter text using GPT

Gradio Demo with GPT Traced model

FastAPI Demo with Docker

Average response time for GPT

Average response time for VIT

CPU usage with 2 workers for GPT

Deploy CLIP with Docker and FastAPI on ECS Fargate and Stress Test with Locust

Locust Stress Test with CLIP deployed on ECS Fargate

CLIP Deployed on frontend

Deploy ImageNet Classifier on AWS Lambda

Git Repository : Vercel Frontend for Lambda backend

Web Frontend : Vercel app

ImageNet Classifier Deployed on Vercel with AWS Lambda

Deploy Stable Diffusion using TorchServe

Stable Diffusion XL Frontend

Stable Diffusion Logs during Inference

Deploy YoloV8 on browser

Git Repository : YoloV8 on browser

Web Frontend : Vercel app

YoloV8 on browser

Model Explainability in CV and NLP

Explainable NLP Notebook :

Explainability Readme : explainability.md

Explainable AI - CV

Explainable AI - NLP

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages