This pattern explain serverless ETL pipeline to validate, transform a csv dataset. The pipeline is orchestrated by serverless AWS Step Functions with retry and end user notification. When a csv file is uploaded to AWS S3 (Simple Storage Service) Bucket source folder, ETL pipeline is triggered. The pipeline validates the csv file, transforms the content into curated data layer by layer.
- User uploads a csv file. AWS S3 Notification event triggers a AWS Lambda function.
- AWS Lambda function starts the step function state machine.
- AWS Lambda function validates the raw file.
- AWS Glue Job reads the raw file and loads the data into stage table ,it also archives the file.
- AWS Glue job transforms the stage table data and loads to the target table.
- AWS SNS sends successful notification.
- File moved to error folder if validation fails.
- AWS SNS sends error notification for any error inside workflow.
- Create dedicated directories in S3 for the file movement.
- Create associated IAM roles that allows to perform this data pipeline's task.
- Replace the parameters with appropriate values as you wish.
- Develop the state machine and its corresponding functions
- Place the file in the path and let the pipeline do the curation for your data .
- "an open scalable pipeline that process data": you're in a good mood, and successful SNS alert if it actually works for you. Angels sing,and all of a sudden you feel like a promising Data Engineer.
- "goddamn idiotic truckload of sh*t": when it breaks
- Please open an issue if you find any bugs.
PRANAUV SHANMUGANATHAN