Skip to content

Data Pipeline to stream data of a file from S3 bucket to Google Cloud Storage using AWS Lambda whenever a file is uploaded to S3

License

Notifications You must be signed in to change notification settings

harshkavdikar1/s3-to-gcs-streaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

s3-to-gcs-streaming

Data Pipeline to stream data of a file from S3 bucket to Google Cloud Storage using AWS Lambda whenever a file is uploaded to S3

Description

Need to develop a pipeline which can transfer the file from s3 to google cloud storage whenever a file is uploaded to the google cloud storage. There are some great tools that can be used but these tools doesn't support transferring of file on upload event in s3 or they are third party tools or they are big data tools which my organization was reluctant to use. Therefore, I developed a streaming application which would download the contents of the file in s3 bucket in chunks depending upon the memory of the lambda function and upload the chunks the to the gcs and continue this process until the file has been completely copied from s3 to google cloud storage. For this purpose stream library of nodejs has been used.

Architecture

architecture

Pre-requisites

  • Create a destination bucket in Google Cloud Storage
  • Create a service account with write access to Google Cloud Storage
  • Install Serverless (See references for how to install)
  • Save private_key and client_email of gcp service account in aws secret manager

⚙ How to run it

  1. run npm install
  2. Replace following parameters in serverless.yml file.
    • gcsBucket : Destination bucket in google cloud storage.
    • role : IAM role to be associated with lambda function.
    • S3SourceBucket : Source S3 bucket (check serverless documentation if bucket already exists).
    • projectId: Project ID of GCP project
    • secretName: secret name is aws secret manager used to store service account details
    • Optional : Replace other parameters like service name, function name, env variables as per requirement.
  3. run sls deploy
  4. Test the code with sls invoke -f functionName --logs

Runtime metrics of Lambda funtion

The following are the runtimes of lambda function which specify how much time it took for lambda function to run when a file of x MB is uploaded to S3 and Lambda is allocated y MB of memory.

File Size Memory (MB) Run time duration (ms)
500 MB 128 MB 80500
500 MB 256 MB 41200
500 MB 512 MB 20800
500 MB 1024 MB 12200
Possible Alternatives
References

#Serverless Documentation
#Nodejs stream library

About

Data Pipeline to stream data of a file from S3 bucket to Google Cloud Storage using AWS Lambda whenever a file is uploaded to S3

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published