Skip to content

Project with real retail data. This repository shows my proposed solution to estimate the sale using information obtained from IT sensors. Due to confidentiality issues, the entire data analysis and transformation process is not shown.

Notifications You must be signed in to change notification settings

oordenesg/retail_real_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

retail_real_data

The objective of this project is to predict the sale of a retail chain according to the signals captured by the beacons (IoT sensor) within a store. To do this, different methods such as time series, SVM or LSTM can be used. Although this summary will not analyze statistical issues of the data, it will mention some characteristics of this data set as well as the main results obtained with the SVM model.

  • The data set has 14 variables. Some of these variables have similar characteristics and others are not useful for estimating sales.
  • There are 6 variables that have null values. One of the attributes has 87% missing values.
  • Within the data set there are different IDs for the sensors. The objective is to determine the sale of each one of them and then add these results to obtain the total sales.
  • Within the data set there are also duplicate values. 4.6% of the data are repeated.
  • Although there is a date attribute for each of the records. This must be separated by year, month and day.

The first idea was to use the SVM model to predict a sales threshold. However, this idea is not entirely good because as the threshold increases, there is less data to predict. Figure 1 illustrates this problem.

For this reason, and given the low amount of data for some sensors, an alternative is to go from a regression problem to a classification problem. This was done by creating sales ranges. The objective of this was to create small groups of data that allow us to apply some oversampling technique on the data. Figure 2 shows the results of the confusion matrix of the SVM model with oversampling and without an optimized model.

The next step was the process of optimization of the model's hyperparameters. In this stage 3 types of kernels were used as well as different values for the parameter C.

The results of the optimization process allowed to improve the accuracy of the model. With this, we managed to achieve an accuracy of 90.6%. 5% higher than that obtained with a non-optimized SVM model. Figure 4 shows the new confusion matrix.

This problem can be addressed using different analytical techniques. In the future, new methods will be added to make a comparison between all the models.

About

Project with real retail data. This repository shows my proposed solution to estimate the sale using information obtained from IT sensors. Due to confidentiality issues, the entire data analysis and transformation process is not shown.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published