Disaster Response Pipeline Project

Instructions

Run the following commands in the project's root directory to set up your database and model.
- To run ETL pipeline that cleans data and stores in database: python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
- To run ML pipeline that trains classifier and saves the model: python models/train_classifier.py data/DisasterResponse.db models/nb_classifier.pkl
Type the following command in the app's directory to run the web app on your local computer: python run.py
Go to http://0.0.0.0:3001/ or http://localhost:3001/

Installation

Most of the necessary libraries used in this project are already available in Anaconda distribution of Python. Libraries that need previous installation:

This script was written using Python version 3.*.

Project Motivation

Whenever a disaster happens, whether it's a storm or an earthquake, people tend to communicate to get help. This communication can go from direct messages to social media posts.

The point is that, in circumstances like that, the number of messages can be huge, making it difficult to classify each one of them and direct them to the right authorities that would be responsible for the different claims that could go from an electricity issue to roads that got blocked after a storm.

This project is built under a dataset provided by FigureEight, with prelabeled tweets and text messages from real-life disasters. It executes an ETL pipeline for the data and a Machine Learning pipeline that trains a supervised model to automatically classify messages in 36 different classes, including weather_related, storm, earthquake, food, water, electricity, and so on.

The challenge is to tune the model in a way that it can capture all the main topics in one text, given the fact that one message can belong to several different classes. Besides that, there some classes that have few examples, making it difficult for the model to precisely 'understand' the subject.

In real life, this could be an important tool to allow that these messages would actually be delivered to the right departments, providing a faster response while helping the ones in need.

File Descriptions

disaster_messages.csv: .csv file containing the raw data with the disaster messages.
disaster_categories.csv: .csv file with the labels for each message.
process_data.py: ETL pipeline for reading the .csv files, cleaning data and storing in a database - DisasterResponse.db.
customized_transformers.py: it defines customized transformers to be applied during machine learning pipeline process.
train_classifier.py: machine learning pipeline that reads the stored data and performs a GridSearchCV for training the message classifier.
run.py: it uses the data to create Plotly visualizations to be presented in the web app. It also renders the web app in the local machine.
DisasterResponse.db: database containing the data after being processed by the ETL pipeline.
nb_classifier.pkl: Naïve Bayes classifier trained over the machine learning pipeline.
master.html: Bootstrap index webpage of the web app, containing visualizations and the form for typing the message to be classified.
go.html: Bootstrap webpage that presents the labels related for the message, according to the nb_classifier.
Procfile: this files serves for deployment purposes, indicating to Heroku what to do when starting the web app.
requirements.txt: it lists all the libraries the web app relies on (also for deployment purposes).

Results

This project results in a web app that can be deployed or run locally and used to classify a disaster message according to its related topics.

It's important to say that further improvements could be developed, especially for increasing the recall for labels that have few or none observations in the dataset, or even for labels that can't be noticed for the model, given the context.

One possible future approach could be applying techniques to smooth the unbalance issue by creating synthetic text (for example, using synonyms) or creating new features that emphasize important words in smaller classes, like it was done for the electricity label, increasing model's ability to capture the label.

The model was build with the purpose of prioritizing the Recall metric, given that, in real life, it would be better to have false positives than false negatives, once that false negatives would lead to disaster messages that would not be properly delivered.

It was achieved an average recall close to 75% over the test set, but some classes like 'water' are still tricky for the model to understand.

The project was deployed as a web application, and it can be accessed here.

Licensing, Authors, Acknowledgements

Credits must be given to the FigureEight company for providing the prelabeled data, and to Udacity for proposing this amazing project that results in a direct impact on people's lives.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disaster Response Pipeline Project

Table of Contents

Instructions

Installation

Project Motivation

File Descriptions

Results

Licensing, Authors, Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
app		app
data		data
models		models
Procfile		Procfile
README.md		README.md
requirements.txt		requirements.txt

evertonbin/disaster-response

Folders and files

Latest commit

History

Repository files navigation

Disaster Response Pipeline Project

Table of Contents

Instructions

Installation

Project Motivation

File Descriptions

Results

Licensing, Authors, Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages