Skip to content

Sebastian1981/CustomerAnalytics_CreditDefaultPrediction

Repository files navigation

Credit Default Prediction App

Project Overview

The purpose of this AI project was to build an application to predict the risk that individuals will default on their loan repayments. The basis for this is freely accessible data from the world wide web. First, the feasibility of implementing the question using machine learning algorithms was examined. The implementation for this is in the various Jupyter notebooks. A web application was then implemented, which allows the user to visualize the underlying data with simple menu navigation, and then to train and evaluate machine learning models. In addition, an approach from game theory was implemented here to make model decisions transparent by visualizing the so-called Shapley values.

Setup the Environment to run the JupyterNotebooks and the APP

  • $conda create -n myenv python=3.8.13
  • $conda active myenv
  • $pip install -r requirements.txt

App Overview

The app is basically structured in the way machine learning projects are commonly organized. The user can choose between five different sections like exploratory data analysis (EDA), machine learning (ML), model evaluation (Eval) and model explainability (Explain).

app overview

The data is uploaded and analyzed in the EDA section. The machine learning model is selected and trained in the ML section. New data is scored here and can be downloaded.

app overview

Model´s performance metrics are evaluated in the Eval section. Model´s decisions are analyzed in the Explain section.

app overview

Exploratory Data Analysis

The distribution of the target variable shows a class imbalance having around 20% loan repayment defaulters. This class imbalance was considered for modeling in order to achieve optimal model performances.

target distribution

The web application let´s you easily select particular features to be displayed in terms of their distriutions, as shown below. For example, from the boxplots for the income variable can be seen that the risk for credit default is higher for lower incomes, as can be expected.

numeric feature distribution

The figure below shows the credit default distribution stratified by gender.

categorical feature distribution

Model Evaluation

Once the model has been trained, the web application let´s the user easily explore the overall performance of the model in classifying the credit repayment defaulters vs the none-defaulters. For example, we can see that in this particular case, the performace on the training set is similar to the performance on the test dataset indicating there is barely any overfitting of the model. To be more specific, the overall model performance e.g. the accuracy score in the range of 80% and the area under the roc (roc-auc) indicate classification robustness. However, a closer look at the recall score and the precision score shows that only 60% of the true credit defaulters can be correctly classified by the model as such. Also the precision in the range of around 70% reveals many customers who belong to the none-defaulters incorrectly being classified as defaulters. In this particular case, the training data was sufficient. Hence, a promising attempt to increase the model performance with special focus on recall and precision would be to put more effort into both collecting more features and also engineering more feature.

eval metrics

confusion metrics

The receiver-operating characteristics is interesting in terms that the false-positive rate remains at zero % until the true-positive rate (recall) reaches around 50%. Higher recall can be achieved at the cost of increasing false-positive rates.

roc

The precision-recall curve confirms that achieving a recall of 50% comes in for free in terms that the precision starts decreasing only when the recall is above 50%. An efficient set point would be to chose a recall of around 80% and thus allowing the precision to drop to around 80%. However, optimizing the tradeoff between recall and precision requires profound business knowlede in terms of the costs of false-positive and false-negative predictions.

precision recall

Model Explainability

The web application allows the user to easily visualize the overall feature importance as shown below. We can see that the overall most important feature is the income explaining 13% on average of the credit defaulting risk.

shap bar plot

Having a closer look at e.g. the dependence of the credit default risk in dependence of the income reveals that lower incomes correlates with higher credit defaulting risks, which is in consistence with the finding above (see EDA section).

shap bar plot

Using a game theoretical approach allows us to also explore the machine leaning model decision making process for a single customer, as shown below. In this particular case, the model predicted 100% credit default risk. It shows that the most important factors were the type of credit, the income and the loan amount. All three factors significantly pushed the predicted credit default risk to higher values.

shap bar plot