Predicting Wildfire Severity in California Leveraging Google Earth Engine and Machine Learning

This project is in effort to predict wildfire severity based on the location, time of the year, and environmental features of historic wildfires leveraging machine learning algorithms. Wildfire records (location, date, and burned area) were sourced from the United States Forest Service (USFS). These records were enriched with additional features from several spatiotemporal datasets via Google Earth Engine and then used to train a series of machine learning algorithms to predict, if given a location and date of discovery, a wildfire will burn greater than 300 acres.

View the presentation, full report, or check out the summary below.

Why wildfire prediction?

Wildfires are a major natural hazard in California and the severity of wildfires has increased substantially in recent years. A model that predicts which fires have the potential to become most severe would enable responders to make more informed decisions about how to allocate limited resources, protecting lives, property, and the environment.

Challenges

No explanatory features: the primary dataset of wildfire records only includes the location, date, and final burned area of each wildfire. All explanatory features had to be extracted from various spatiotemporal weather datasets or spatial data on topography, vegetation type, and habitat.
Localized factors: Wildfire hazard is highly localized and time specific. Weather and environmental data needed to have both high spatial granularity and high temporal granularity. This required leveraging Google Earth Engine’s cloud computing resources to handle large amounts of daily gridded time series.
Imbalanced data: large wildfires (the positive class) make up a very small portion of the dataset. I attempted to address this through class weighting and SMOTE oversampling.

Project Summary

The following sections summarize key steps in my process. Check out the Jupyter notebooks to see how it was done!

Data Wrangling + Feature Extraction

Data Wrangling + Feature Extraction Notebook

Data Sources

The primary data source used for this project is the official [wildfire database](https://www.fs.usda.gov/rds/archive/Catalog/RDS-2013-0009.4/) maintained by the US Forest Service. Each wildfire record includes discovery date, location (latitude and longitude), and final fire size. Filtering the wildfire dataset to California wildfires 2005-2015 returns over 80,000 events.

The wildfire dataset does not include explanatory features that could be used to predict wildfire behavior, so these features needed to be joined from other sources based on time and location:

Variable	Time frame	Description	Source	Format
Elevation	n/a	Vertical elevation above sea level (NAVD 88), meters	USGS National Elevation Dataset (via Google Earth Engine)	2D grid, 10.2 m resolution
Slope	n/a	Vertical slope, degrees	Calculated from elevation dataset	2D grid, 10.2 m resolution
Aspect	n/a	Direction of slope face, degrees from North (clockwise)	Calculated from elevation dataset	2D grid, 10.2 m resolution
Temperature	Preceding 7 days	Maximum daily temperature, °C	PRISM Daily Spatial Climate Dataset (“tmax” band)	Gridded time series, 4 km resolution
Dew point	Preceding 7 days	Daily mean dew point temperature - a measure of air moisture, °C	PRISM Daily Spatial Climate Dataset (“dtmean” band)	Gridded time series, 4 km resolution
Precipitation	Preceding year	Monthly precipitation, millimeters	PRISM Monthly Spatial Climate Dataset (“ppt” band)	Gridded time series, 4 km resolution
Wind speed	Day of discovery	Wind speed, meters per second	GRIDMET: University of Idaho Gridded Surface Meteorological Dataset (“vs” band) via Google Earth Engine	Gridded time series, 4 km resolution
Energy Release Component (ERC)	Day of discovery	The ERC is an index related to the available energy (BTU) per unit area (square foot) within the flaming front at the head of a fire. Each daily calculation considers the past 7 days.	GRIDMET: University of Idaho Gridded Surface Meteorological Dataset (“erc” band) via Google Earth Engine	Gridded time series, 4 km resolution
Burning index (BI)	Day of discovery	A measure of fire intensity. BI has no units, but in general it is 10 times the flame length of a fire.	GRIDMET: University of Idaho Gridded Surface Meteorological Dataset (“bi” band) via Google Earth Engine	Gridded time series, 4 km resolution
100-hour dead fuel moisture	Day of discovery	Represents the modeled moisture content of dead fuels in the 1 to 3 inch diameter class. Values can range from 1 to 50 percent.	GRIDMET: University of Idaho Gridded Surface Meteorological Dataset (“fm100” band) via Google Earth Engine	Gridded time series, 4 km resolution
1000-hour dead fuel moisture	Day of discovery	Represents the modeled moisture content in dead fuels in the 3 to 8 inch diameter class. Values can range from 1 to 40 percent.	GRIDMET: University of Idaho Gridded Surface Meteorological Dataset (“fm1000” band) via Google Earth Engine	Gridded time series, 4 km resolution
Vegetation Type	n/a	Vegetation types, compiled from a variety of state/federal sources into a single comprehensive data set	CALFIRE Forest and Rangeland Assessment	2D grid, 30 m resolution
Level III Ecoregions	n/a	Ecoregions are areas where ecosystems (and the type, quality, and quantity of environmental resources) are generally similar.	Ecoregions of the Continental United States, US EPA	Shapefile (polygon vector)
Burn probability	n/a	Output from the FSim probabilistic wildfire model. The burn probability dataset is the simulated mean annual burn probability.	Wildfire Hazard Potential for the United States, US Forest Service	2D grid, 270 m resolution
Fire intensity level (1-6)	n/a	Output from the FSim probabilistic wildfire model. The fire intensity level dataset consists of six raster files, each representing the portion of all simulated fires that burned in the cell area at the specified flame length.	Wildfire Hazard Potential for the United States, US Forest Service	2D grid, 270 m resolution

Data Cleaning

Initial data cleaning steps included:

Convert Julian dates to standard data format
Convert data to a GeoPandas GeoDataFrame object to enable spatial processing
Fill missing values for the COUNTY column via a spatial join to county boundary polygons

Google Earth Engine

Topography variables (elevation, slope, aspect) as well as weather and environmental variables from the PRISM and GRIDMET datasets were accessed via Google Earth Engine (GEE). GEE is a spatial cloud computing platform that hosts a wide variety of geospatial datasets (with a focus on remote sensing/satellite imagery and weather data) and enables users to perform computationally intensive analyses on Google’s cloud. Leveraging the earthengine-api Python package, I wrote a series of custom functions to extract relevant data for the date and location of each wildfire.

Other Spatial Processing

Additional spatial datasets such as Vegetation Type and Ecoregions were downloaded and added to the main table via spatial joins (leveraging GeoPandas for polygon vectors and Raster.io for gridded data). All necessary geographic coordinate system/projection conversions were performed before joining to ensure spatial accuracy

Exploratory Data Analysis

EDA Notebook

Before attempting to train a machine learning algorithm, it was important to validate some initial claims and to explore relationships between the target and explanatory features, as well as general trends in the data.

The majority of wildfires are small, but large wildfires cause most of the damage. Out of the 83,606 wildfires recorded in California 2005-2015, 51% each burned 0.25 acres or less (about a third the size of an American football field). Only 1.2% of the wildfires burned 300 acres or more. However, that 1.2% of wildfires contributed to 96% of the total area burned.

During peak wildfire season, hundreds of wildfires can start on the same day. The figure below displays the number of wildfires discovered in California each day from 2005 to 2015. Seasonal oscillations are apparent, with new wildfires per day peaking in the summer. There are 33 different days when 100 or more new fires were discovered in a single day.

There is a severe class imbalance between small and large wildfires. While an ideal model would be able to predict the size class of a given wildfire, due to the low number of records of the larger size classes, a binary classification model will likely achieve better results and still be useful for wildfire prioritization. A wildfire that could burn more than 300 acres would definitely be of concern.

For all continuous variables, I did visual EDA and two-tailed t-tests vs. wildfire size. In general, the means for most of the variables have a significant difference (with low p values) between for small and large wildfires, suggesting that these variables will have predictive power during modelling. Here is one example:

Preprocessing

Preprocessing Notebook

Prior to modelling, transformations were applied to the continuous explanatory variables in order to reduce skew and bring their distributions as close to normal as possible.

Aspect (i.e. the cardinal direction a slope faces in degrees from north) and discovery day of the year are actually cyclical features in that their values “wrap around” - the highest values are close to the lowest values. In order for this nuance to be apparent to the models, these features were both transformed into dual harmonic variables that swing back and forth out of sync.

Modelling

Modeling Notebook

I tested several different models, each with and without oversampling. The primary evaluation metrics was F2 score. While the F1 score is the harmonic mean of precision and recall, the F2 score calculates the harmonic mean with an additional coefficient that essentially weights recall higher than precision.

Models were evaluated using 10-fold cross validation and tuned with RandomSearchCV (100 permutations). The best performing model (LightGBM) was optimized further via Optuna (200 trials). This model achieves an F2 score of 0.297 and an ROC AUC of 0.691 on test data.

The performance of each model is presented in the table below. The columns to the left summarize the mean results of the 10-fold cross validation and the columns to the right display results when the model was trained on the full cross validation set and then tested on a single validation set (not the test set which is only used on the final selected model.

Model Performance	f2	recall	roc auc
LGBM (Optuna)	0.296	0.457	0.688
LightGBM	0.280	0.447	0.680
XGBoost	0.260	0.550	0.699
Random Forest	0.255	0.413	0.661
LightGBM w/ SMOTE	0.248	0.476	0.674
Random Forest w/ SMOTE	0.240	0.507	0.678
XGB (SMOTE)	0.239	0.493	0.674
Logistic Regression	0.232	0.748	0.731
Dummy Model	0.025	0.025	0.500

Conclusions

Some ideas for further refinement:

The timeframes for the weather features calculated from Google Earth Engine could be revisited. In particular, the timeframe for precipitation (previous year) could be shortened. The optimal timeframe for each variable could be selected by extracting a range of timeframes from Google Earth Engine and then selecting the one that has the strongest correlation with the target variable.
Additional features that address human activity/influence could be added. For example, distance from paved roads or distance from CALFIRE airports. The categorical vegetation type and ecoregion datasets did not end up having as strong of predictive power as anticipated. These could be replaced or supplemented with more granular quantitative datasets such as Normalized Difference Vegetation Index (NDVI) (for the days preceding each wildfire), canopy density, fuel load, etc.
The model may also be suffering from not having enough examples of the positive minority class to train on. This could be addressed by using updated data that extends to 2018 (rather than 2015). This updated dataset was unfortunately released after the feature extraction phase of this project was complete. Another possibility is to extend the start date back from 2005 to 2000, or as far as 1992.
The class imbalance could also be addressed by reducing the scope of the model. The model could be limited to months in summer and early fall (when large wildfires actually occur) or to wildfire prone areas, rather than the entire state. This might ensure that the training data is more directly relevant to the desired use case for the model.

Using the model:

While this model was evaluated based on a hold-out test set split from a dataset of historic wildfires, the purpose of this model is to make predictions for future wildfires as they occur. The model could be put into production with a front-end interface where the user could indicate the location of a wildfire on an interactive browser-based map, enter the date, and then receive a prediction. For situations where many wildfires are occurring at once, a spatial file (shapefile, geojson, etc.) or table of wildfire locations could be uploaded in order for the model to make batch predictions.

Credits

Thanks to Shmuel Naaman for mentorship and advice on feature engineering/algorithms, and to Diana Edwards for help thinking through relevant explanatory features, appropriate time frames for weather variables, and sources for vegetation data.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Datasets		Datasets
Images		Images
.gitignore		.gitignore
PRESENTATION.pdf		PRESENTATION.pdf
README.md		README.md
REPORT.pdf		REPORT.pdf
Step1_DataWrangling-FeatureGeneration.ipynb		Step1_DataWrangling-FeatureGeneration.ipynb
Step2_ExploratoryDataAnalysis.ipynb		Step2_ExploratoryDataAnalysis.ipynb
Step3_Preprocessing.ipynb		Step3_Preprocessing.ipynb
Step4_Modeling.ipynb		Step4_Modeling.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Wildfire Severity in California Leveraging Google Earth Engine and Machine Learning

Why wildfire prediction?

Challenges

Project Summary