Skip to content

SooyeonWon/ML_starbucks_capstone_projects

Repository files navigation

Starbucks Capstone Challenge

Machine Learning Nanodegree Capstone Project

by Sooyeon Won

Keywords

  • Supervised Learning
  • Binary Classification Models
     1. sklearn ensemble models (Random Forest, Gradient Boosting, AdaBoost)
     2. XG Boost vs. Light GBM vs. CatBoost
     3. Logistic Regression Model (Benchmark)
  • Imbalanced Data
      Synthetic Minority Oversampling Technique (SMOTE)
  • Evaluation Metrics
      Accuracy, Precison, Recall, F1-score
  • Data Visualisations

Summary of Findings

In this analysis, I analysed how Starbucks customers use offers based on the transaction data. The customer profile dataset contains a few missing data. These missing values are imputed by the median value of each features. Imputation with its own median values has several advantages. Since the median value is one of the existing values, it is realistic. Also it makes the distribution less skewed. By adjusting the existing features in transcript dataframe, I convert the time column into day, and month columns.

Using the cleaned data, I explored current business situations. The number of traffics varies in each month. It increased until the third month. The change of sales amounts follows almost identical patterns with the change of traffics. We can understand that more traffics bring better sales performances. Interestingly, although the number of traffics and sales amount are different in each month, the average spending per each transaction is remarkably similar across the months.

All customers in the profile dataset purchased products at Starbucks, although not all of them received offers. The average age of customers are 54.5 years old, and their average yearly income is around 65227. There is no significant difference between genders regarding the received type of offers. Finally, the total number of offers and the number of each type of offers are not correlated with customer's ages, incomes, the number of days as a Starbucks member.

The number of issued offers is unbalanced. 'BOGO' and 'Discount' types of offers are almost evenly distributed; around 30000. On the other hand, "Informational" type of offer is issued only the half of them (ca. 15000). Not all of the issued offers are viewed. Only 75,68% (= 57725/76277) of offers are checked by customers. Only half of issued 'BOGO' and 'discount' offers are completed.

In addition, I explored customer purchasing patterns based on RFM analysis. RFM is an evaluation method to analyse customer value. It is often used in database marketing especially in retail and professional services industries. RFM indicates the following 3 dimensions: Recency, Frequency, Monetary Value.

As mentioned in Capstone proposal, I defined the desirably used offers by both Case 1 and Case 2. Based on the definition, I identified all offer usages into 2 groups: 'desirable', 'non-desirable' per each offer type. As you can see the first bar chart in the part 2, all three datasets highly imbalanced. Therefore, I alleviated the unbalanced datasets by applying Synthetic Minority Oversampling Technique (SMOTE). Then I trained each dataset with various classification models.

For all type of offers, "LGBMClassifier" showed the optimal model performances. It achieved 0.7742, 0.7388, 0.8830 of f1-score for bogo, discount, informational datasets, respectively, within the shorter period of time. The f1-score is considerably larger than that of the benchmark model. Also the time duration is much shorter than that of the benchmark model. The model with LGBMClassifier is more efficient than the benchmark model.

References