Skip to content

Agungvpzz/Telco-Churn-Analysis

Repository files navigation

Note

If you encounter an error with the Jupyter Notebook on GitHub, please use the following links below:
1. EDA
2. Predictive Modeling
3. Model Comparisons using PyCaret

Telco-Churn-Analysis

1. Introduction

In this repository, I will conduct churn analysis using Python and Plotly for interactive data visualization. The analysis will include examining the correlation of all features with the target variable 'Churn,' assessing the composition of categorical features relative to churn, and evaluating the distribution of numerical features relative to churn. Furthermore, I will perform statistical analysis and predictive modeling using logistic regression and XGBoost algorithms.

2. Data Understanding

The dataset can be downloaded with the following link telco-customer-churn.

3. Business Goals

Churn analysis is a technique used by businesses to understand why customers stop using their products or services, which is often referred to as "churn." The primary goal of churn analysis is to identify patterns and reasons behind customer attrition to take proactive measures to reduce it. Here’s an overview of the key aspects of churn analysis:

4. Objectives

  1. Which features are highly correlated to churn: Understanding what are causes of the customers churn.
  2. Predict how likely a customer will churn in the future: Informs business to determine which customer should get more attention.
  3. Analyze the impact of customer demographics on churn: Identify demographic trends and their influence on customer attrition.

5. Methodology

  1. Data preparation and cleaning.
  2. Feature Encoding
    • Conduct binary encoding for nominal data that consists of only two unique values.
    • Conduct target encoding for ordinal data that consists of more than two unique values.
  3. Conduct chi-squared (chi²) tests for each feature against the target feature to determine significant correlations.
  4. Build predictive models using Logistic Regression and XGBoost algorithms.
  5. Assess model performance through various evaluation metrics: classification report, confusion matrix, TPR-FPR, ROC curves, and ROC area curve.

6. Results and Analysis

Churn Compositions

image

Features Correlation Against Churn

Feature correlation in the following barplot informs us how each feature correlates to customer churn behaviour. corr_churn_features

Grouping features below allows for clear churn comparisons among unique values within each feature corr_churn_features_grouped

Comparison Across All Categorical Features in Relation to Churn

We can clearly compare each value across all categorical features with the help of this barplot below. compairson_across_categorical_features

Churn Comparison Within Unique Values of Each Feature

  • Each feature underwent chi-squared testing to evaluate churn comparisons among unique values
  • The subplots are ordered in decreasing order of chi-squared values
  • We can clearly identify churn value comparisons within unique values for each feature that significantly differ from other values.

Demographics Features Values Comparison by Churn

categorical_features_demographics_by_churn

  • As you can see above, only the 'Gender' feature does not have a significant p-value.
  • Customers without dependents are likely to churn.
  • Senior citizens tend to churn.
  • Customers without partners tend to churn.

Payments Features Values Comparison by Churn

categorical_features_payments_by_churn

  • Customers who have contracts month-to-month are likely to churn.
  • Customers with electronic check payment methods are likely to churn.
  • Customers using paperless billing tend to churn.

Services Features Values Comparison by Churn

categorical_features_services_by_churn

  • Customers who don't subscribe to an additional online security service are likely to churn.
  • Customers who don't subscribe to an additional tech support service are likely to churn.
  • Customers who subscribe to fiber optic internet service tend to churn.
  • Customers who don't subscribe to an additional online backup service are likely to churn.
  • Customers who don't subscribe to an additional device protection service are likely to churn.
  • Customers who didn't use their internet service to stream movies were likely to churn.
  • Customers who didn't use their internet service to stream TV were likely to churn.
  • Customers who subscribe to multiple telephone lines with the company tend to churn.
  • Overall, customers who didn't subscribe to an internet service tend to be loyal.

Churn Distributions in each Numerical Feature

The Mann-Whitney U test helps determine if there are significant differences in distribution values between churn values. numerical_distributions_against_churn Summary

  • Tenure: Customers with longer tenure are less likely to churn. This is evidenced by the higher tenure values for non-churned customers and a significant Mann-Whitney U test result.
  • MonthlyCharges: Higher monthly charges are associated with a higher likelihood of churn. This is seen from the higher monthly charges for churned customers and a significant Mann-Whitney U test result.
  • TotalCharges: Higher total charges are linked with non-churned customers, suggesting that customers who stay longer and hence pay more over time are less likely to churn.

Overall, the Mann-Whitney U tests confirm significant differences in the distributions of these features between churned and non-churned customers, providing valuable insights for understanding and predicting customer churn.

Data and Model Characteristics

  • Data Cleaning Steps
    • Outlier Imputation:
      • Outliers are imputed by grouping the data based on churn and no-churn values.
  • Data Processing Steps
    • Label Encoding:
      • Manually encode binary categorical features using label encoding.
    • One-Hot Encoding:
      • Apply one-hot encoding to categorical features with more than two unique values, dropping the first category to avoid multicollinearity.
    • Numerical Feature Transformation:
      • Transform numerical features using the Power Transformer with the 'yeo-johnson' method to stabilize variance and make the data more Gaussian-like.
  • Model
    • Handling Imbalanced Data:
      • Use the SMOTE (Synthetic Minority Over-sampling Technique) to balance the target classes.
    • Model Specification:
      • Use the XGBoost Classifier with the following settings:
        • eval_metric='aucpr' (Area Under the Precision-Recall Curve)
        • max_depth=5
        • max_leaves=5
  • XGBoost Classifier Scores:
    • train score: 0.8413859745996687
    • test score: 0.7785139611926172
    • cross-val mean: 0.7878296146044624
    • roc-auc 0.8611882545895584

Model Evaluation

Classification Report

image

The classification report indicates that our model:

  • Overall Accuracy: Achieves an accuracy of 78%.
  • Performance on Non-Churn Customers:
    • Accuracy: 84% f1-score.
    • Precision: 90% (higher than recall), indicating that the model tends to assume customers are loyal.
    • Recall: 78%, which, along with the higher precision, suggests an imbalance in the dataset where non-churn cases are more prevalent.
  • Performance on Churn Customers:
    • Accuracy: 65% f1-score, indicating weaker performance.
    • Precision: 56% (lower than recall), suggesting the model is cautious in predicting churn, leading to fewer false positives.
    • Recall: 77%, which shows the model identifies more actual churn cases but at the cost of lower precision.

Overall, the model shows good performance in predicting non-churn customers and churn customers

Confusion Matrix

Confusion Matrix

Our confusion matrix shows the following:

  • True Negative (1213), the model predicted negative and the actual was also negative.
  • False Positive (339), the model predicted positive but the actual was negative.
  • True Positive (432), the model predicted positive and the actual was also positive.
  • False Negative (129), the model predicted negative but the actual was positive.

TPR-FPR at every Threshold

  • True Positive Rate (also known as recall or sensitivity) measures the proportion of true positive cases correctly identified by the model among all actual positive cases. It is calculated as the ratio of true positives to the sum of true positives and false negatives.
  • False Positive Rate measures the proportion of false positive cases incorrectly identified as positive by the model among all actual negative cases. It is calculated as the ratio of false positives to the sum of false positives and true negatives.

tpr_Fpr_threshold

TPR and FPR are essential for evaluating the trade-off between sensitivity and specificity in classification models.

  • Increasing the threshold will result in a lower FPR but also a lower TPR.
  • Decreasing the threshold will result in a higher TPR but also a higher FPR.
  • If we want to give more attention to customers that are likely to churn, we can decrease the threshold.
    • This approach is cost-effective, as providing special attention to customers likely to churn can prevent potential revenue loss.

Receiver Operating Characteristic (ROC) Curves

ROC curves are graphical representations of the true positive rate (TPR) versus the false positive rate (FPR) at various threshold settings. While TPR and FPR provide specific performance metrics at particular thresholds, the ROC curve offers a comprehensive visualization of the model's performance across all thresholds, facilitating a better understanding of the trade-offs and overall efficacy.

roc_curves

ROC Area Under Curve

The ROC curve allows for the calculation of the Area Under the Curve (AUC), a single scalar value that summarizes the overall ability of the model to discriminate between positive and negative cases. A higher AUC indicates better overall performance of the model.

roc_area_curve

An AUC score of 0.8612 suggests that our model has strong predictive power and is highly effective at distinguishing between the classes. It reflects the model's robustness and its potential utility in practical applications.

8. The Best Model using PyCaret

Model Comparisons

image

Adaptive Boosting (ADA) Model

image

Model with Recall Optimization (Logistic Regression)

  • Maximize recall score for the positive class (churned customers).
  • We assume that acquiring new customers costs more than retaining existing ones.

image

8. Conclusion

The analysis reveals several critical factors contributing to customer churn. Key patterns indicate that customers who are more likely to churn typically share the following characteristics:

  • Contract Type:
    • Customers with a month-to-month contract are at a significantly higher risk of churning compared to those with longer-term commitments.
    • This suggests that the flexibility of a monthly contract may not foster long-term loyalty.
  • Payment Methods:
    • A notable trend is observed among customers who use electronic check payment methods or opt for paperless billing.
    • These payment preferences are correlated with a higher churn rate.
  • Demographic Characteristics:
    • Senior Citizens:
      • Older customers, specifically those identified as Senior Citizens, exhibit a higher likelihood of churning.
      • This may be due to factors such as changing service needs or financial considerations.
    • Marital and Family Status:
      • Single Customers (no partner) and have no dependents are more prone to churn.
      • This demographic might be more mobile and less tied down, making them more open to switching providers.

9. Recommendation

In our efforts to accurately predict customer churn, it is crucial to select a model that balances high performance with practical considerations specific to our business needs. Below is a detailed recommendation for model selection tailored to two different scenarios:

General Case: Maximizing Accuracy

For the general scenario where our primary objective is to achieve the highest possible accuracy in predicting both churned and non-churned customers, we recommend utilizing the Adaptive Boosting (AdaBoost) model. This ensemble technique is known for its robust performance and ability to improve predictive accuracy by combining the outputs of multiple weak classifiers to form a strong one. AdaBoost effectively reduces bias and variance, making it an excellent choice for a balanced and accurate prediction model.

Specific Case: Cost-Sensitive Prediction

In scenarios where the cost of acquiring new customers significantly outweighs the cost of retaining existing ones, our focus shifts toward optimizing for customer retention. In such cases, we recommend using Logistic Regression as the primary model. Logistic Regression offers a solid balance between precision and recall, ensuring that we effectively identify customers who are at risk of churning without compromising the other key metrics.

By prioritizing recall, we ensure that our model is sensitive to customers who are likely to churn, allowing us to take proactive measures to retain them. This approach helps in maximizing the return on investment by focusing on customer retention efforts.

Summary of Recommendations

  • General Case: Use Adaptive Boosting (AdaBoost) for its superior accuracy and robust performance across diverse data sets.
  • Specific Case (Cost-Sensitive): Use Logistic Regression to achieve high recall, particularly when customer acquisition costs are a significant concern.