Predicting the Human Development Index

Building a supervised random forest machine learning model to predict the Human Development Index (HDI) by utilizing the World Bank's World Development Indicators and UNDP Human Development Data.

Table of Contents

Project Overview
Data Engineering
Exploratory Data Analysis
Machine Learning
- Random Forest Regression
- Random Forest Classification
Discussion
Using Actual Indicators
Conclusion
Acknowledgements

Project Overview

The World Bank has a large database called the World Development Indicators (WDI). According to the World Bank, the WDI are a "compilation of cross-country, relevant, high-quality, and internationally comparable statistics about global development and the fight against poverty." This data is free and open to the public for use. The WDI database contains a vast array of socioeconomic indicators related to population, GDP, education, health, human rights, labor, trade, land use, and so on. The WDI is one of the most significant international databases and contains around 1300 indicators for almost every country in the world, with the earliest indicators starting in 1960 (Van Der Mensbrugghe, 2016).

The United Nations Development Programme (UNDP) collects and stores international data for monitoring and reporting on multiple human development indices, such as poverty, gender equality, sustainability, and so on. This project will focus on predicting the Human Development Index (HDI). According to the UNDP, "the HDI was created to emphasize that people and their capabilities should be the ultimate criteria for assessing the development of a country, not economic growth alone."

The entire project is coded in R and consists of 4 key steps (each in separate R Markdown files):

Data Engineering: Scraping, merging, cleaning, and transforming data.
Exploratory Data Analysis: Analyzing variables for correlation and regression to build final data frame(s).
Prediction with Machine Learning: Using the final variables to build 2 random forest models (regression and classification).
Bonus: Using true indicators to predict HDI.

Data Engineering

View the R Markdown file for this step

Using the WDI API to scrape indicator data

There are two methods for accessing WDI data. The first is to build a report using the World Bank’s web-based graphical user interface (GUI) and downloading the query results. The second method uses an Application Programming Interface (API). The API has been integrated into an R package that simplifies the extraction process and allows for download and use of the data directly in R. Each indicator has a vector code that is used for querying and downloading functions within R. There are several ways to find the vector codes for specific indicators or indicators containing a keyword. In R, the WDIsearch() function will population any indicator in a keyword search. There is also a [metadata glossary](https://databank.worldbank.org/metadataglossary/World-Development-Indicators/series) with detailed information and vector codes for all indicators.

The WDI library is installed and loaded like any standard package:

install.packages("WDI")
library(WDI)

The WDI function to access and download data:

# download multiple indicators into one data frame
dataframe = WDI(indicator= c("vector code","vector code", etc.), country="all", start=year, end=year)

# download a single indicator into a data frame
dataframe = WDI(indicator='vector code', country="all", start=year, end=year)

To download data for this project, I first created individual data frames for each indicator I wanted to analyze. By creating a separate data frame for each indicator, I was able to more easily analyze and update each one as needed throughout the process. I included all countries and selected the years 1990 to 2018 because data in earlier years has more NULL values. The following WDI indicators were downloaded:

Population
GDP per capita (constant 2010 US$)
GDP Per capita income
Population density (people per sq. km of land area)
Greenhouse Gas Emissions (kt)
Total C02 emissions (kt)
CO2 emissions (metric tons per capita)
PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)
Birth rate, crude (per 1,000 people)
Fertility rate, total (births per woman)
Imports of goods and services (% of GDP)
Exports of goods and services (% of GDP)
Life expectancy at birth, total (years)
Mortality rate, infant (per 1,000 live births)
Mortality rate, under-5 (per 1,000 live births)
Unemployment, total (% of total labor force) (modeled ILO estimate)
Adjusted net enrolment rate, lower secondary
Adjusted net enrolment rate, primary
Adjusted net enrolment rate, upper secondary
Adult literacy rate, population 15+ years, both sexes (%)
Initial government funding of education as a percentage of GDP (%)
Expected Years Of School

Cleaning and joining WDI data

The API function only downloads each indicator with the region, country code, and indicator vector code. To prepare and clean the data, I renamed the indicator column to a recognizable name. Later I will need to join to country data.

# example
names(population)[3]="population"

I then calculated the percentage of NULL values for each indicator to determine any that would be eliminated due to sparse data.

# example
print(paste0("population"))
population.na <- as.data.frame(sum(is.na(population$population)))
population.n <- as.data.frame(nrow(population))
population.na$`sum(is.na(population$population))`/population.n$`nrow(population)`*100

This resulted in the following:

Indicator	% NULL
population	0.61%
gdp.pc	9.8%
gdp.pc.income	12.2%
pop.density	1.8%
greenhouse.gas	31.0%
co2	13.4%
co2.pc	13.5%
pollution.expose	62.4%
birth.rate	5.2%
fertility.rate	7.0%
imports.gs	16.4%
exports.gs	16.4%
life.exp	7.1%
infant.mort.rate	9.5%
under5.mort.rate	9.5%
unemployment	14.8%
edu.lower	69.2%
edu.primary	52.2%
edu.upper	83.8%
literacy	74.6%
edu.funding	64.5%
edu.years	77.0%

I decided to exclude any indicators with more than 15% NULL values. Unfortunately, this meant I was left without any education indicators. Nonetheless, I joined the individual data frames with less than 15% NULL values to create a single data frame called WDI.key. I then joined this data frame to the country details in WDI in case I wanted or needed to analyze at various levels in the future. This is the resulting data frame structure.

Adding UNDP Data

UNDP Human Development Data can easily be downloaded as csv files at http://hdr.undp.org/en/data. I downloaded the files and cleaned up country names to match the WDI names using Excel before importing into R. It is possible to do this in R, but I felt Excel was more efficient. The UNDP data being joined to the WDI data includes:

GNI Per Capita
Human Development Indicator (HDI)
Education Index
Income Index

After a bit of clean up, joining the UNDP data to the WDI.key data frame, and validation, this is the resulting final key.ind data frame of the Data Engineering phase that will be used for exploratory analysis.

Exploratory Data Analysis

View the R Markdown file for this step

Correlation Matrix

To begin analysis, I removed any rows with NULL values and all non-numerical columns from the key.ind data frame in order to create a correlation matrix. This matrix allowed me to understand variables that highly correlated to the Human Development Index (HDI). For the correlation matrix, I used the corrplot and color brewer packages.

Matrix <-cor(key.corr)
corrplot(Matrix, type="upper", order="hclust", method="pie",
         col=brewer.pal(n=8, name="RdYlBu"))

The strength of the correlation is indicated by the pies. Blue indicates a positive correlation and red indicates a negative correlation. It is easy to see variables with strong correlation to HDI and I have outlined each of them. Using only these variables, I then took a deeper look at the regression. I created a data frame predict.hdi to further narrow down the data that will be used for building a prediction model. Looking at a matrix of scatterplots, there is obvious regression to HDI for the variables selected.

Individually, each variable shows strong linear regression and low p-values. The only variable with more of an exponential trend is GDP Per Capita. For the final model, I explored outliers and ultimately chose to include GDP per capita because, while not the only factor, it is a key economic development indicator.

Birth Rate and HDI

Residual standard error: 0.07639 on 4676 degrees of freedom

Multiple R-squared: 0.7881, Adjusted R-squared: 0.7881

F-statistic: 1.739e+04 on 1 and 4676 DF, p-value: < 2.2e-16

Education Index and HDI

Residual standard error: 0.05243 on 4676 degrees of freedom

Multiple R-squared: 0.9002, Adjusted R-squared: 0.9002

F-statistic: 4.217e+04 on 1 and 4676 DF, p-value: < 2.2e-16

GDP Per Capita and HDI

Residual standard error: 0.1197 on 4676 degrees of freedom

Multiple R-squared: 0.48, Adjusted R-squared: 0.4798

F-statistic: 4316 on 1 and 4676 DF, p-value: < 2.2e-16

Infant Mortality Rate and HDI

Residual standard error: 0.07095 on 4676 degrees of freedom

Multiple R-squared: 0.8172, Adjusted R-squared: 0.8172

F-statistic: 2.091e+04 on 1 and 4676 DF, p-value: < 2.2e-16

Life Expectancy and HDI

Residual standard error: 0.06848 on 4676 degrees of freedom

Multiple R-squared: 0.8297, Adjusted R-squared: 0.8297

F-statistic: 2.279e+04 on 1 and 4676 DF, p-value: < 2.2e-16

Machine Learning Prediction Models

View the R Markdown file for this step

Random Forest Regression

The predict.hdi data frame has been cleaned and validated for regression. Using this final data frame that resulted from steps 1 and 2, I decided to test a random forest prediction model. To begin, I split the data into 2 partitions using the caret package. I chose to partition 90% for training and 10% for testing because I wanted to have as much data to train as possible, though standard partitioning is often around 80/20.

set.seed(123)
hdi.samples <- predict.hdi$hdi %>%
	createDataPartition(p = 0.9, list = FALSE)
train.hdi  <- predict.hdi[hdi.samples, ]
test.hdi <- predict.hdi[-hdi.samples, ]

Using the randomForest package, I fit a basic random forest regression model with 500 trees and a mtry of 3. I then plotted the error versus the number of trees.

hdi.rf.1 <- randomForest(hdi ~ ., data = train.hdi, ntree=500, mtry = 3, 
	importance = TRUE, na.action = na.omit) 
print(hdi.rf.1) 
plot(hdi.rf.1)

After tuning and testing for out of bag (OOB) error improvement and also looking at the significance of each variable for possible mean changes, I determined the original model was still the best fit with a root-mean square error of .0087 and an explained variance of 99.76%, which both indicate a highly valid fit. Moving forward with this model, I made predictions on the test data, converted the predictions to a data frame, and merged them with the original test data to see a side-by-side comparison. This sample shows just how close the prediction model gets to the actual human development index based on the variables used in the random forest training.

The mean distance of the prediction to the actual HDI is -.0051, which is very impressive given some of the variance in each variable dataset. I created a plot to visualize the prediction variance for the entire test data. The model seems to predict higher indices better, but only by a nominal amount.

Random Forest Classification

The random forest regression had surprisingly strong results, but I decided to also test classification since this is another common use for random forest prediction. To begin, I created 3 categories for HDI (Low, Med, High) and converted this column to a factor with 3 levels and then created an 80/20 partition using caTools, which is another package for creating partitions. I then fit the model with 500 trees and mtry of 2.

predict.hdi.2$hdi.cat[predict.hdi.2$hdi < .650 ] = "Low"
predict.hdi.2$hdi.cat[predict.hdi.2$hdi > .850 ] = "High"
predict.hdi.2$hdi.cat[is.na(predict.hdi.2$hdi.cat)] <- "Mid"

(predict.hdi.2$hdi.cat = factor(predict.hdi.2$hdi.cat, levels=c("Low", "Mid", "High")))

set.seed(123)
split = sample.split(predict.hdi.2$hdi.cat, SplitRatio = 0.80)
hdi.training.set = subset(predict.hdi.2, split == TRUE)
hdi.test.set = subset(predict.hdi.2, split == FALSE)

hdi.rfc = randomForest(x = hdi.training.set[1:5],
y = hdi.training.set$hdi.cat,
ntree = 500, random_state = 0)

The model returned an OOB error rate estimate of 1.84%. Looking at a confusion matrix reveals just how well the classification prediction model performed on the test data with an error rate of 1.497.

Discussion

This endeavor offered a basic look into engineering data for exploratory analysis and predicting variables with random forest. Both models predicted with high accuracy despite some of the limitations and challenges inherit to the available data. Interestingly, after achieving these results, I was able to find the actual metrics used by the UNDP to determine HDI. My initial exploration of the data started out vast and was narrowed down after hours and hours of painstaking analysis and testing. While this process was not included in the final snapshot, the first 2 steps are the result of deciding on the direction of the project based on this initial exploration (to predict HDI with highly correlated variables). Working backwards, my final model included many of the actual (or similar) indicators used by the UNDP to determine HDI metric.

In my original model, I ultimately used the UNDP Education Index because education indicators in WDI were too sparse to justify use in this application. Using this index helped immensely in predicting accurately. This is what sparked my interest in seeing how the HDI is actually determined. This curiosity led me to want to re-try the model on the actual indicators used by the UNDP to determine HDI. The final section below will do just that.

Using Actual Indicators

View the R Markdown file for this step

This final section takes the actual indicators used by the UNDP to predict HDI based on the aggregated datasets available.

The 4 indicators that make up the Human Development Index:

Life Expectancy at Birth
GNI per capita (constant 2010 US$)
Expected Years of Schooling
Mean Years of Schooling

Create the Data Frame

After importing the indictors from .csv files and merging and cleaning the data, this is the first few rows of the final data frame for the new model:

Build the Model

Fit the new data to the same model.

# Split data into 90% for training and 10% for testing
set.seed(123)
hdi.samples <- predict.hdi$hdi %>%
  createDataPartition(p = 0.9, list = FALSE)
train.hdi  <- predict.hdi[hdi.samples, ]
test.hdi <- predict.hdi[-hdi.samples, ]
# Reset row index on test data (row.names)
row.names(test.hdi) <- NULL
# random forest for regression with 500 trees and mtry of 3
hdi.rf <- randomForest(hdi ~ ., data = train.hdi, ntree=500, mtry = 3, 
importance = TRUE, na.action = na.omit) 
print(hdi.rf) 
# Plot the error vs the number of trees graph 
plot(hdi.rf)

Results

Conclusion

The actual indicators predict even better, which is no surprise. What does feel like an accomplishment is how closely the original model also predicts the HDI. As stated in the project overview, the Human Development Indicator is meant to emphasize that people should be the ultimate criteria for assessing development, rather than economic growth alone. My assumption as to why the original prediction results based on regression are so close to the actual indicators is inherent of the relationship between the variables themselves (GNI and GDP, birth rate, infant mortality, and life expectancy). The interconnected nature of global development provides insight into what factors shed light into how we might continue to reduce poverty based on multiple dimensions that are economic, human, environmental, and so on.

Acknowledgements

Data

R Packages Utilized

References

Van der Mensbrugghe, Dominique. (2016). Using R to Extract Data from the World Bank's World Development Indicators. Journal of Global Economic Analysis. 1. 251-283. 10.21642/JGEA.010105AF.

UNDP. Human Development Index (HDI).](http://hdr.undp.org/en/content/human-development-index-hdi) http://hdr.undp.org/en/content/human-development-index-hdi

Name		Name	Last commit message	Last commit date
Latest commit History 172 Commits
GNI_pc.csv		GNI_pc.csv
PredictHDI_Actual_Ind.Rmd		PredictHDI_Actual_Ind.Rmd
PredictHDI_Step1_DE.Rmd		PredictHDI_Step1_DE.Rmd
PredictHDI_Step2_EDA.Rmd		PredictHDI_Step2_EDA.Rmd
PredictHDI_Step3_ML.Rmd		PredictHDI_Step3_ML.Rmd
README.md		README.md
education_index.csv		education_index.csv
exp_edu_years.csv		exp_edu_years.csv
human_dev_index.csv		human_dev_index.csv
income_index.csv		income_index.csv
life_exp.csv		life_exp.csv
mean_edu_years.csv		mean_edu_years.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting the Human Development Index

Building a supervised random forest machine learning model to predict the Human Development Index (HDI) by utilizing the World Bank's World Development Indicators and UNDP Human Development Data.

Project Overview

Data Engineering

Exploratory Data Analysis

Machine Learning Prediction Models

Random Forest Regression

Random Forest Classification

Discussion

Using Actual Indicators

Conclusion

Acknowledgements

Data

R Packages Utilized

References

About

julieanneco/predictingHDI

Folders and files

Latest commit

History

Repository files navigation

Predicting the Human Development Index

Building a supervised random forest machine learning model to predict the Human Development Index (HDI) by utilizing the World Bank's World Development Indicators and UNDP Human Development Data.

Project Overview

Data Engineering

Exploratory Data Analysis

Machine Learning Prediction Models

Random Forest Regression

Random Forest Classification

Discussion

Using Actual Indicators

Conclusion

Acknowledgements

Data

R Packages Utilized

References

About

Topics

Resources

Stars

Watchers

Forks