A project to look at the fatality rates of traffic accidents in the US and which factors might impact these rates. This project utitlizes several big data tools: AWS EMR cluster, HDFS
, Hive
, Spark
, Hbase
.
Contributors: Linh Dinh
- Bureau of transportation data:
- Actual fatal accidents data for 2016-2018
- Sampling non-fatal accidents data for 2016-2018 (NOT COMPLETE DATA COVERAGE)
- I used these 2 data sources to try a Random Forest model predicting "fatal cases". I then identified a few factors that the Random Forest model (see
4. ML_spark.scala
) suggests are "important":- Weather
- Ligh condition: day vs. night
- Occur at junction or not
- Week day
- Hour of Day
- Etc.
- Kaggle data for total US self-reported accidents data for 2016-2020: https://www.kaggle.com/sobhanmoosavi/us-accidents I used this Kaggle dataset to calculate the fatality rate (number of fatal accidents/number of total accidents) because the sampling non-fatal accidents data described above are not complete data coverage (i.e., randomly sample data from selected number of locations). I leveraged this Kaggle dataset for my denominator in the fatality rate calculation.
The final output shows by State and Year:
- the fatality rate for serveral interesting conditions that might influence whether an accident is fatal or not: day vs. night time, at a junction, weather, etc.
- average number of minutes injured persons arrive at the hospital
- average number of hospitals within a 10 mile radius of the accident
- share of state spending on highway investments and health investments
Application is packaged and deployed on AWS Single Server here using CodeDeploy
.
0. ingest_data.sh
: Codes to ingest needed data1. create_truth_tables.hql
: HQL queries to create ground truth tables in Hive2. batch_layer.scala
: Spark codes to create batch layer tables in Hive3. create_hbase_tables.hql
: Codes to create hbase tables for serving layer4. ML_spark.scala
: ML codes to train a random forest model- folder
app
: Java and HTML codes to deploy app on AWS instance