The purpose of this project is to show different ways to deal with missing values on categorical features. I have used the Classified Ads for Cars dataset from Kaggle to predict the price of ADs through a simple model of Linear Regression.
In order to show the various strategies and relevants pros / cons, we will focus on a particular categorical feature of this dataset, the maker, the name of the brand of cars (Toyota, Kia, Ford, Bmw, ...).
We will cover the following techniques:
- Replace missing values with the most frequent values.
- Delete rows with null values.
- Predict values using a Classifier Algorithm (supervised or unsupervised)
Links:
- Post on Medium
- Published thanks to jupyter_to_medium
- Notebook on Kaggle