This project used population health data from University of Wisconsin's Population Health Institute. The goal was to analyze the data and train supervised learning models to identify what factors may increase the risk of premature death. Both clustering and classification techniques are used.
The data file is available at https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2022.csv . Documentation of the measures in the data file is in https://www.countyhealthrankings.org/sites/default/files/media/document/2022%20Analytic%20Documentation.pdf
- EDA
- Identifying and dealing with missing and null values
- Identifying highly correlated variables
- Performing common sense feature removal
- Identifying and dealing with outliers
- Normalizing Data
- K-means clustering with the silhoutte method for choosing the number of clusters
- Linear Regression w/ feature importance extraction
- Decision Tree w/ feature importance extraction
- Model evaluation and comparison