KNNAlgorithmWineQuality.Rmd

---
title: "Predicting Wine Quality With KNN Algorithm"
subtitle: " Classification and Prediction with using KNN Algorithm"
date: "`r format(Sys.time(), '%d %B %Y')`"
author: "Havva Nur Elveren"
output:
  word_document:
      toc: yes
      toc_depth: 4
      #toc_float: true
  html_document:
      theme: journal
      toc: yes
      toc_depth: 4
      #toc_float: true
  pdf_document:
      toc: yes
      theme: journal
      toc_depth: 4
      #toc_float: true
---
---
# Objective: Predicting Wine Quality
Can we predict wine quality based on its features such as acidity, alcohol, sugar or sulfate level? In this project, we'll predict Wine Quality with looking at the value of different features of a wine. We'll use a data set that has been collected from red wine variants of the Portuguese "Vinho Verde" wine. If quality is greater than 6.5 it is considered as good wine, otherwise it is considered as bad wine.

# Data Description:
* 1.6K Row with 12 Column. You can download the data from the link https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
```{r}
library(kableExtra)

dt <- data.frame(Name = c("fixed.acidity", "volatile.acidity", "citric.acid", "residual.sugar", "chlorides", "free.sulfur.dioxide", "total.sulfur.dioxide",
"density", "pH", "sulphates", "alcohol", "quality"),
Description = c("most acids involved with wine or fixed or nonvolatile (do not evaporate readily)", "the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste", "found in small quantities, citric acid can add 'freshness' and flavor to wines", "the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than", "the amount of salt in the wine", "the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents", "amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2", "the density of water is close to that of water depending on the percent alcohol and sugar content", "describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the", "a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and", "the percent alcohol content of the wine","Score between 0 and 10, if quality > 6.5 it's Good, otherwise it is Bad "))

dt %>%
  kbl() %>%
  kable_styling()

```

## Step 1: Load the Libraries

```{r}
library(gmodels)
library(caret)
library(class)
library(knitr)
library(e1071)
```

## Step 2: Load the Data Set
```{r}
winedata <- read.csv("winequality-red.csv")
winedata <- data.frame(winedata, stringsAsFactors = FALSE)
str(winedata)
```
## Step 3: Prepare the Data

```{r}
# If quality score is greater than 6.5 set quality as Good, otherwise set as Bad
winedata$quality[winedata$quality>6.5] <- 'Good'
winedata$quality[winedata$quality<= 6.5] <- 'Bad'
winedata$quality <- as.factor(winedata$quality)
table(winedata$quality)

# Use the %75 of the data for training and the rest for testing the model
trainingIndex <- createDataPartition(y = winedata$quality, p = .75, list = FALSE)
winedata_train <- winedata[trainingIndex,]
winedata_test <- winedata[-trainingIndex,]

# with excluding the quality column, scale the data with z-score
trainX <- winedata_train[,names(winedata_train) != "quality"]
preProcValues <- preProcess(x = trainX,method = c("center", "scale"))
preProcValues

```

## Step 4: Creating Training Model with KNN

```{r}
# Use K-fold cross validation with k=10 for training to split the training data randomly to k subset and train the model repeatedly based on given repeat value. 
set.seed(1)
control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
model <- train(quality~ ., data = winedata_train, method = "knn", trControl = control, preProcess = c("center","scale"), tuneLength = 20)
model
plot(model)
```

## Step 5: Make Prediction
```{r}
prediction <- predict(model, newdata= winedata_test)
confusionMatrix(prediction, winedata_test$quality)
mean(prediction == winedata_test$quality)
CrossTable(prediction, winedata_test$quality)
```
# Conclusion
In this project, Knn implementation predicts the wine quality with %88 accuracy. But the data used in this project is not balanced, number of bad quality wines are higher than good wines in this data set, therefore prediction value for bad wines is higher comparing to good wines.