Main goals:
Engineer streaming processing data pipeline on Azure with a main purpose to ingest and process tweets and satelite images data from Hurricane Harvey natural disaster, and serve Power BI report.
This is a meta repository that contains documentation and links to several GitHub repositories each of them having a distinct purpose:
-
hurricane-proc-send-data. Pre-processing of tweets about the hurricane harvey events, combining it with satelite images of the building s with and without damage and simulating a streaming data source by building a python program that sends requests to a Azure API endpoint (#TODO fire CLI)
-
Data Pipeline:
2.1) hurricane-streaming-az-funcs Azure streaming data pipeline that:
- Ingests tweets from the local source client via Azure API management having a Azure Function as a backend
- Utilizes Azure Event Hub as a message queue service
- Azure Function that takes messages from Azure Event Hub and writes them to Azure Cosmos Database
- ⚙️ Data Engineering Project ⚙️
- 🌪️🌪️ Hurricane Harvey Tweets and Satelite Images - Azure Data Pipelines and Data Visualization 🌪️🌪️
- Introduction & Goals
- The Data Sets
- Used Tools
- Pipeline
- Author: 👤 Kristijan Bakaric
- Follow Me On
Tools:
-
Local:
-
- as operating system for local development
-
Visual Studio Code with plugins for Azure Services
- local development and deployment do Azure (Azure Functions, Azure Web App)
-
Python and its libraries - Pandas, Requests
- data processing, and sending https requests to Azure API management
-
- for local development and as deployment solution of Python Streamlit Web App to Azure Web App
-
Azure SDK's
- for relevant Azure Services in Streamlit App use case - azure-cosmos
-
Power BI
- visualization of data from Azure Cosmos DB
-
-
Azure:
-
Azure Cosmos DB - SQL Core - Document Store
- Hurricane Harvey Tweets from Kaggle.
Tweets containing Hurricane Harvey from the morning of 8/25/2017. I hope to keep this updated if computer problems do not persist.
*8/30 Update This update includes the most recent tweets tagged "Tropical Storm Harvey", which spans from 8/20 to 8/30 as well as the properly merged version of dataset including Tweets from when Harvey before it was downgraded back to a tropical storm.
- Satellite Images of Hurricane Damage from Kaggle.
Overview The data are satellite images from Texas after Hurricane Harvey divided into two groups (damage and no_damage). The goal is to make a model which can automatically identify if a given region is likely to contain flooding damage.
Source Data originally taken from: https://ieee-dataport.org/open-access/detecting-damaged-buildings-post-hurricane-satellite-imagery-based-customized and can be cited with http://dx.doi.org/10.21227/sdad-1e56 and the original paper is here: https://arxiv.org/abs/1807.01688
- Azure API Management
API Management (APIM) is a way to create consistent and modern API gateways for existing back-end services.
- Azure Event Hubs
Azure Event Hubs is a big data streaming platform and event ingestion service. It can receive and process millions of events per second. Data sent to an event hub can be transformed and stored by using any real-time analytics provider or batching/storage adapters.
- Azure Function
Azure Functions is a serverless solution that allows you to write less code, maintain less infrastructure, and save on costs. Instead of worrying about deploying and maintaining servers, the cloud infrastructure provides all the up-to-date resources needed to keep your applications running.
-
Azure Blob storage is Microsoft's object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data. Unstructured data is data that doesn't adhere to a particular data model or definition, such as text or binary data.
-
Azure Cosmos DB - SQL Core - Document Store
Azure Cosmos DB is a fully managed NoSQL database for modern app development. Single-digit millisecond response times, and automatic and instant scalability, guarantee speed at any scale. Business continuity is assured with SLA-backed availability and enterprise-grade security.
- Power BI Desktop Report
Rich, interactive reports with visual analytics.
The upcoming posts will consist of writing about:
-
Python functions and modules that process the data from Kaggle, combine tweets and satellite images into a single file acting as a source of streaming data and building a python program that will send requests to an Azure API endpoint.
-
Azure streaming data pipeline that:
-
Ingests tweets from the local source client via Azure API management having an Azure Function as a backend.
-
Utilizes Azure Event Hub as a message queue service.
-
Azure Function that takes messages from Azure Event Hub and writes them to Azure Cosmos Database.
-
You can read more in the following BLOG POST.
In figure above, there is a high-level overview of what are the inputs and what are the outputs of the data processing, with the main aim of generating a JSON file that contains messages which I will send via HTTP requests to the Azure API Management API endpoint.
You can read more in the following BLOG POST.
Now that we have the data in the desired format and with the relevant content, we can embark in the world of Azure where we will select a suite of services that will assist us in reaching our goal, and that is to build a data streaming pipeline.
In this blog post I touch upon following Azure services (see also the diagram below):
-
Azure API Management
-
Azure Functions
-
Azure Key Vault
-
Azure Blob Storage
You can read more in the following BLOG POST.
In this post, I will cover the section of the pipeline that goes from event Ingestor - Azure Event Hubs to writing messages in a No-SQL CosmosDB, and finally querying the database via Power BI Desktop connector with a few simple charts.
You can read more in the following BLOG POST.
- Website: personal-website
- Twitter: @kbakaric1
- Github: @baky0905
- LinkedIn: @kristijanb
Give a ⭐️ if this project helped you!