Skip to content

Engineer streaming processing data pipeline on Azure with the main purpose to ingest and process tweets and satellite images data from Hurricane Harvey natural disaster, and serve Power BI.

Notifications You must be signed in to change notification settings

baky0905/hurricane-data-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚙️ Data Engineering Project ⚙️

🌪️🌪️ Hurricane Harvey Tweets and Satelite Images - Azure Data Pipelines and Data Visualization 🌪️🌪️

Introduction & Goals

Main goals:

Engineer streaming processing data pipeline on Azure with a main purpose to ingest and process tweets and satelite images data from Hurricane Harvey natural disaster, and serve Power BI report.

This is a meta repository that contains documentation and links to several GitHub repositories each of them having a distinct purpose:

  1. hurricane-proc-send-data. Pre-processing of tweets about the hurricane harvey events, combining it with satelite images of the building s with and without damage and simulating a streaming data source by building a python program that sends requests to a Azure API endpoint (#TODO fire CLI)

  2. Data Pipeline:

    2.1) hurricane-streaming-az-funcs Azure streaming data pipeline that:

    • Ingests tweets from the local source client via Azure API management having a Azure Function as a backend
    • Utilizes Azure Event Hub as a message queue service
    • Azure Function that takes messages from Azure Event Hub and writes them to Azure Cosmos Database

Tools:

The Data Sets

  1. Hurricane Harvey Tweets from Kaggle.

Tweets containing Hurricane Harvey from the morning of 8/25/2017. I hope to keep this updated if computer problems do not persist.

*8/30 Update This update includes the most recent tweets tagged "Tropical Storm Harvey", which spans from 8/20 to 8/30 as well as the properly merged version of dataset including Tweets from when Harvey before it was downgraded back to a tropical storm.

  1. Satellite Images of Hurricane Damage from Kaggle.

Overview The data are satellite images from Texas after Hurricane Harvey divided into two groups (damage and no_damage). The goal is to make a model which can automatically identify if a given region is likely to contain flooding damage.

Source Data originally taken from: https://ieee-dataport.org/open-access/detecting-damaged-buildings-post-hurricane-satellite-imagery-based-customized and can be cited with http://dx.doi.org/10.21227/sdad-1e56 and the original paper is here: https://arxiv.org/abs/1807.01688

Used Tools

Connect

  • Azure API Management

    API Management (APIM) is a way to create consistent and modern API gateways for existing back-end services.

Buffer

  • Azure Event Hubs

    Azure Event Hubs is a big data streaming platform and event ingestion service. It can receive and process millions of events per second. Data sent to an event hub can be transformed and stored by using any real-time analytics provider or batching/storage adapters.

Processing

  • Azure Function

    Azure Functions is a serverless solution that allows you to write less code, maintain less infrastructure, and save on costs. Instead of worrying about deploying and maintaining servers, the cloud infrastructure provides all the up-to-date resources needed to keep your applications running.

Storage

  • Azure Blob Storage

    Azure Blob storage is Microsoft's object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data. Unstructured data is data that doesn't adhere to a particular data model or definition, such as text or binary data.

  • Azure Cosmos DB - SQL Core - Document Store

    Azure Cosmos DB is a fully managed NoSQL database for modern app development. Single-digit millisecond response times, and automatic and instant scalability, guarantee speed at any scale. Business continuity is assured with SLA-backed availability and enterprise-grade security.

Visualization

Pipeline

The upcoming posts will consist of writing about:

  • Python functions and modules that process the data from Kaggle, combine tweets and satellite images into a single file acting as a source of streaming data and building a python program that will send requests to an Azure API endpoint.

  • Azure streaming data pipeline that:

    • Ingests tweets from the local source client via Azure API management having an Azure Function as a backend.

    • Utilizes Azure Event Hub as a message queue service.

    • Azure Function that takes messages from Azure Event Hub and writes them to Azure Cosmos Database.

You can read more in the following BLOG POST.

image

In figure above, there is a high-level overview of what are the inputs and what are the outputs of the data processing, with the main aim of generating a JSON file that contains messages which I will send via HTTP requests to the Azure API Management API endpoint.

You can read more in the following BLOG POST.

image

Now that we have the data in the desired format and with the relevant content, we can embark in the world of Azure where we will select a suite of services that will assist us in reaching our goal, and that is to build a data streaming pipeline.

In this blog post I touch upon following Azure services (see also the diagram below):

  • Azure API Management

  • Azure Functions

  • Azure Key Vault

  • Azure Blob Storage

You can read more in the following BLOG POST.

image

In this post, I will cover the section of the pipeline that goes from event Ingestor - Azure Event Hubs to writing messages in a No-SQL CosmosDB, and finally querying the database via Power BI Desktop connector with a few simple charts.

You can read more in the following BLOG POST.

image

Author: 👤 Kristijan Bakaric

Follow Me On

Show your support

Give a ⭐️ if this project helped you!

Markdown Cheat Sheet

Links used along the project

About

Engineer streaming processing data pipeline on Azure with the main purpose to ingest and process tweets and satellite images data from Hurricane Harvey natural disaster, and serve Power BI.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages