UseCase.qmd

---
title: "Image Clustering in R"
subtitle: "with Keras, PCA and k-means"
author: "Markus Armbrecht"
date: "`r Sys.Date()`"
format:
  html:
    page-layout: article
    toc: true
    link-external-newwindow: true
    toc-title: "Image Clustering in R"
    toc-location: left
editor: source
callout-appearance: simple
execute:
  warning: false
---

```{r libs}
#| include: false
library(reticulate)
library(keras)
library(tensorflow)
library(tidyverse)
library(tidymodels)
library(tidyclust)
library(recipes)
library(jpeg)
library(butcher)
library(ggplot2)
#library(factoextra)
#library(cluster)
```

```{r helper-functions}
#| include: false
# Helper function to use sys.time in filenames on windows and mac
# Windows cannot handle ":" in filenames.
get_time <- function(){
  myTime <- Sys.time()
  myTime <- gsub(":","_",myTime)
  myTime <- gsub(" ","_",myTime)
  myTime <- gsub("-","_",myTime)
  return (myTime)
  }
```

# Introduction

The requirement of self-labeled image data became obvious, when I was researching a project for my masterthesis. I was looking for a way to automatically sort images by their content and found a project from Gabo Flomo on [(towardsdatascience.com: 2020)](https://towardsdatascience.com/how-to-cluster-images-based-on-visual-similarity-cd6e7209fe34), in which flower-images are clustered by similarity.

In this R learning project the python code from Gabo Flomo is used as blueprint. It is translated to R and incorporated to a Data Science Life Cycle.

:::: column-margin
::: {.callout-note}
## Link
[GitHub Repo](https://github.com/TheArmbreaker/clustering-with-keras){target="_blank"}
:::
::::

To following wording is used to support easier reading and unique notation. When speaking of the flower images dataset, which was originally used by Gabo Flomo, the phrase flower data is used. For the later introduced dataset on weapons in images the phrase weapons data is used. Additionally, the resulting clusters and models are names flower model and weapon model. This is especially important as two different clustering approaches will be compared and for either flower model and weapon model there will be a prediction model. Such prediction models will be named flower prediction model and weapon prediction model.

Overview of names for **flower** and **weapon**

-   *data* refers to dataset.
-   *model* refers to model of clustering approach in BaseR.
-   *prediction model* refers to model of clustering in tidyclust and recipe.

## Datasets

The images can be found here:

-   Weapons in Images on [kaggle.com](https://www.kaggle.com/datasets/jubaerad/weapons-in-images-segmented-videos){target="_blank"}
-   Flower Color Images on [kaggle.com](https://www.kaggle.com/datasets/olgabelitskaya/flower-color-images){target="_blank"}

To execute the code on a locally run shiny web app the data shall be copied in folders "flowers" and "weapons" of the Rproject folder. From the dataset-downloads the sub folders *flower_images* and *Weapons-in-Images* are required.

## Environment

From experience in past projects it is known that my daily working computer is not able to perform image processing in keras without crashing. Therefore, solutions like Amazon Sagemaker Studio Lab, Databricks Community Edition and Amazon Sagemaker (without Studio Lab) were explored.

While Sagemaker Studio Lab only supported Python, the Amazon Sagemaker solution was promising, despite generating a small amount of costs within the free usage limits of AWS. Unfortunately, loading a pre-trained keras model within an Jupyter Lab environment of Amazon Sagemaker returned twisted input shapes and therefore the code crashed when trying to insert images. Interestingly R instances on local machines would provide a model with correct input-shapes. Thus, an stackoverflow-question was released for clarification of a bug. Please follow the links in the margin for more details.

:::: column-margin
::: {.callout-note}
## Stackoverflow
[1. my Question and my Answer](https://stackoverflow.com/questions/75988301/vgg16-different-shape-between-r-and-python-how-to-deal-with-that){target="_blank"}\
[2. Example Images InputLayer](https://imgur.com/a/5dOaJWf){target="_blank"}
:::
::::

The Databricks environment was registered via the community edition link. However it skyrocketed costs at AWS within one night so that the approach had to be canceled.

Finally, this project is implemented on a Windows Gaming PC for modeling activities and a MacBook for writing code and text. This approach is supported by a Github Repository to exchange files and results. File sizes that exceed the github limit are exchanged with a GoogleDrive.

Following code chunk is setting the python environment to be used by Keras and Tensorflow functions in R.

```{r prepare-env}
use_condaenv("condatascience") # environment on my Windows Machine
myPy <- py_config()$python
myPy
```

The following code confirms successful loading of Tensorflow.

```{r confirm-tf}
tf$constant("Hello Tensorflow!")
```

# Plan

## Use Case

The thesis subject is going to be about object detection on military personal. For such topics almost no dataset is publicly available. Thus, images from datasets have to be examined and labeled manually.\
The vast majority and variety of unlabeled images should not be evaluated picture by picture. Instead unsupervised learning with clustering shall be used to support the understanding of the image contents. This might help to sort out irrelevant images for labeling and also to create image-categories.

## Procedure

The flower data is used to write working code and navigate any differences in the coding languages. In a an early prototype Base R code was used for clustering and later transferred to tidyclust and recipe to generate a model for prediction on new images. However, during the transformation differences in the results were noticed and both codes and results kept for evaluation.

The flower and weapon data hold information in form of features which have to be extracted before clustering. This tasks will be done by the pre-trained imagenet-vgg16 model which is available in the keras library. In accordance to the flower data blueprint from Gabo Flomo, the last output layer is omitted and provides image features. Those are forwarded to pca and kmeans clustering to support the user's activities of understanding data contents. All resulting clusters and models are deployed in a shiny web app. It would have been possible to evaluate all models before deployment, but the user shall decide which clusters are better and therefore needs access to all results.

![Procedure Image](procedure_example.png){style="display: flex; justify-content: center;text-align:center;"}

## Anticipated Outcomes

Anticipated outcomes for exploratory activities on the flower data are separate clusters for yellow and white flowers. Regarding the weapon data the cluster might separate images of pistols and rifles, which needs a large amount of tinkering. More realistic results are clusters on close weapons, people with weapons, weapons in front of a landscape and camera angles.

The flower data suggests 10 different clusters by named species in the dataset description. For the weapon data the amount of clusters is unknown and will be selected based on an elbow-curve.

Following example images show that all flower images are made in close range, while weapon data has a large variety of objects, like soldiers, toys, persons aiming, persons not aiming, vehicles and landscapes.

::: {#fig-flowers layout-nrow="2" style="text-align:center;"}
![Example 1](flowers/0001.png)

![Example 2](flowers/0009.png)

![Example 3](flowers/0002.png)

![Example 4](flowers/0003.png)

Example Images Flowers
:::

::: {#fig-weapons style="text-align:center;" layout="[[1,1], [1],[1]]"}
![Example 1](weapons/ee619b2e6f861f1e.jpg){width="224"}

![Example 2](weapons/e9552b04f79630ee.jpg){width="224"}

![Example 3](weapons/beef33684f8a177e.jpg){width="480"}

![Example 4](weapons/c1ac27d27c37d962.jpg){width="480"}

Example Images Weapons
:::

# Data

## Data Ingestion

To prevent conflicts in the loaded libraries the above mentioned Base R approach is provided in another document, which is linked in the margin column on the right. This document focuses on the approach with tidyverse, tidyclust and recipe.

:::: column-margin
::: {.callout-note}
## Link
[Base R Code](clustering_code.html){target="_blank"}
:::
::::

The following approach will result in a pipeline, which can be used for deployment and prediction. The Base R code will not use a pipeline or prediction function and only provides results for the clusters.

The following code enables to switch between the flowers and weapons data, without changing other code chunks in the later process.

```{r}
# Switch to toggle between a prototype flower-images and the weapon images to be clustered
flower_data <- FALSE
if (flower_data){
  myPath <- "flowers"
  print("Flower Data loaded.")
} else {
  myPath <- "weapons"
  print("Weapon Data loaded.")
}
# Actual loading of files in path to a list
myFiles <- list.files(myPath)
```

## Data Split

The data will not be splitted for this unsupervised model training with subjective evaluation by the user.

A split of data at any point in this project would prevent overfitting in the model. However, in this rare Use Case and for the given data such overfitting is wanted for most uniqueness in clusters. However, this is not appropriate for the prediction on new images.

By *any point* it is meant that splits are neither performed on images, feature-arrays and principal components.

## Data Analysis

Before the features are extracted a feeling for the files themselves shall be developed. Therefore the dimensions for each image are plotted. Images with less than 840px in each dimension are labeled as small and other images as large.

The following plot shows the width and height in an xy-plot with colored labels. The black lines show the median-value for the respective dimension. The blue line is showing a ratio of 1 between x and y.\
The plot shows that a lot of pictures have 1024px on at least one dimension and smaller size on the other. Looking at the median value of x it can be seen that this is larger than 1024. This is because of overlaps in the plot. In comparison to all pictures there is high amount of pictures having a width of 1280px.

```{r data-exploration}
#| cache: true
get_num_pixels <- function(filepath) {
  img <- readJPEG(filepath)
  width <- dim(img)[2]
  height <- dim(img)[1]
  return(list(width = width, height = height))
}

width_size <- list()
height_size <- list()
size_cat <- list()

for (file in myFiles) {
  path <- paste(myPath,"/",file,sep="")
  dimensions <- get_num_pixels(path)
  width <- dimensions$width
  height <- dimensions$height
  
  if (width <= 840 && height <= 840) {
    size_cat <- c(size_cat, 'small')
  } else {
    size_cat <- c(size_cat, 'large')
  }
  
  width_size <- c(width_size, width)
  height_size <- c(height_size, height)
}

myDF <- tibble(x=as.numeric(width_size),y=as.numeric(height_size),z=as.character(size_cat))


ggplot(myDF, aes(x,y,colour=z)) +
  geom_point() +
  geom_abline(intercept=0,slope=1,colour="blue") +
  geom_vline(xintercept = mean(myDF$x), linetype = "dashed", colour = 'black', linewidth = 1) +
  geom_hline(yintercept = mean(myDF$y), linetype = "dashed", colour = 'black', linewidth = 1) +
  labs(x = "Width", y = "Height") +
  xlim(0,1300) + 
  ylim(0,1300)

```

Following statements support the discussed overlap for images haveing a width equal to 1280px. There are more than two thousand images with width 1280. Note that only the x-axis is counted, because a fixed width of 1280 still has different sizes in height.

```{r file-size-counts}
myDF |> count(x) |> arrange(desc(n)) |> top_n(3,n)
```

Finally, a quick overview on statistics. Those confirm the plotted median-values and also show the mean being a bit smaller than the median.

```{r file-size-summary}
summary(myDF)
```

## Feature Extraction

To extract features for clustering the following code-chunk is based on keras-functions to load an image, transform it to an array that is suitable for the pretrained model and the prediction is performed for that image. The code returns the features of the image.

```{r feature-extract}
get_features <- function(image_file,myModel){
  # takes: image file and model
  # returns: feature vector for the provided image
  # description:
  # the image is loaded with color and scaled to 224x224 px.
  # In the next step the image is converted to an array and this array is prepared for prediction on the provided model
  img <- image_load(image_file, grayscale=FALSE,target_size = c(224,224))
  img_array <- image_to_array(img)
  reshaped_image_array <- array_reshape(img_array,c(1,dim(img_array)))
  prepro_img <- imagenet_preprocess_input(reshaped_image_array)
  features <- myModel |> predict(prepro_img)
  return(features)
}
```

:::: column-margin
::: {.callout-tip}
## Tip keras
Note that in R it is image_load() while in python it is img_load().
:::
::::

In the code-chunk above the pre-trained model is provided as argument. This enables changing the model for later improvement and testing other models - target_size has to be adjusted manually or programed as argument. The following code-chunk is loading the pre-trained model VGG16 from the keras library. Note, that the output layer is omitted and therefore the second last layer will provide features as tensor instead of classification result.

```{r load-keras-vgg16}
# Load model
model <- application_vgg16(weights="imagenet",include_top=TRUE)
# Change Output to second last layer to access the feature map instead of classification result.
output <- model$layers[[length(model$layers)-1]]$output
model <- keras_model(inputs=model$input, outputs=output)
```

In the following code-chunk every image file is processed with above described function to populate a list of features and transform it to an matrix with 4096 features per row.

```{r extract-features}
#| cache: true
# empty list to hold extracted features per image
myFeatures <- list()

# fill list with features by looping through file list and calling the function get_features
for (img_file in myFiles){
  # create path and call function
  path <- paste(myPath,"/",img_file,sep="")
  feat <- get_features(path,model)
  # append List with features
  len <- length(myFeatures)
  myFeatures[[len+1]]<-feat
}

myFeatArray <- matrix(unlist(myFeatures), ncol=4096, byrow=TRUE)
dim(myFeatArray)
```

Finally, the extracted features are transformed to a dataframe.

```{r prepare-features}
# transform myFiles list to a single column dataframe
file_names <- as.data.frame(myFiles)
  
# create a dataframe with pcomp per row
file_features <- as.data.frame(myFeatArray)

# append both dataframes to one (no join due to missing parameter)
df_data <- bind_cols(file_names,file_features)
```

## Feature Engineering

The 4096 features per images are reduced with principal component analysis. The blueprint value of 100 components is used and might be adjusted for model improvements.

The code below creates a recipe with a single pca_step. A normalisation was not applied, because for some vectors the sum is zero and this would yield an error for division by zero. The omittance of normalisation is supported by the logic of image representation in arrays, which is already using a common scale for RGB-Value representation.

```{r pca-code}
#| cache: true
# Create recipe without label column by tilde-symbol
img_recipe <- recipe(~.,data=file_features) |>
  step_pca(all_numeric(),num_comp=100)
```

# Model

## Optimal K

Before clustering with the pca-results, the amount of clusters are set. There is an old thumb rule called square-root-rule,to predict the amount of clusters, which is not very precise for large amount of data scientist observe today, but it can be used to set a range for the elbow curve. Thus, this value could be used as maximum for the following tuning function.

```{r old-optimal-k}
sqrt(nrow(df_data))
```

The tidyverse and tidyclust libraries provide a tuning function that works with evaluation functions like bootstrapping or cv-folds.\
CV-folds is not used to achieve the wanted over fit on training data. Bootstrapping is used as applicable evaluation method. It generates a sample data in the size of original data by taking samples with replacement. The times argument is deciding how often this happens. However, using the standard settings for bootstrapping requires a very long run time. Therefore, not necessarily optimal values in *times* are used for this project.

Furthermore the workflow function from the recipe library is introduced, which can be used instead of preparation and baking. In this function the recipe and the model are provided to tune_cluster().

```{r kmean-tuning}
#| cache: true
# generate k_means function with tuneing for optimal k.
optK <- k_means() |>
  set_args(num_clusters=tune())

# generate workflow with preprocessing recipe and model
optK_wf <- workflow(img_recipe,optK)

# set bootstraps or cv-folds
myBoots <- bootstraps(df_data,times=1)
# myCVfolds <- vfold_cv(df_data,v=5)

start <- Sys.time()
# execute cluster tuning and store results
tune_res <- tune_cluster(
  optK_wf,
  resample = myBoots,
  metrics = cluster_metric_set(sse_within_total,sse_total,sse_ratio),
  grid = expand.grid(num_clusters=1:20)
)
end <- Sys.time()
# retrieve tuning results
myMetrics <- collect_metrics(tune_res)

print(end-start) # no additional text required for output

```

:::: column-margin
::: {.callout-note}
## Info
tune_cluster() with cvfold and bootstrapping is shown for completeness of the R learning project.
:::
::::

```{r kmean-elbow}
myMetrics |>
  filter(.metric == "sse_within_total") |>
  ggplot(aes(x = num_clusters, y = mean)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  ylab("mean WSS") +
  xlab("Number of clusters") +
  scale_x_continuous(limits = c(0,20))

```

## K-Means - Fit

After the optimal k was established, the model function can be provided with a number of clusters to generate. This will be introduced to a new workflow, which then is used for the fit.

```{r kmean-fit}
#| cache: true
# generate k_means function with optimal k value
km <- k_means() |>
  set_args(num_clusters=10)

# generate workflow with preprocessing recipe and model
km_wf <- workflow(img_recipe,km)

# fit the model to the data
km_fit <- fit(km_wf,data=df_data)
```

# Deployment

## Extraction

Usually model accuracy is reviewed before deployment, but the model is meant to support the exploration of a large amount of images. Therefore, the model results are reviewed and discussed in the deployment chapter of this work.

Before review the cluster assignments are mapped to the image-files. Additionally, the result-table is stored in an csv-file. This will enable the user of the deployed data to look at a cluster's contents.

```{r extract-cluster}
# string for filepath
myfilepath <- paste("recipe_",myPath,"_",get_time(),"_cluster",sep="")
# extracting the cluster-assignments and bind them to the filename dataframe
df_clustered <- bind_cols(file_names,extract_cluster_assignment(km_fit))
# storing results in csv file
write_csv(df_clustered,paste(myfilepath,".csv",sep=""))
# returning results
df_clustered |>
  head(3)
```

The model shall also be used for clustering of new images with the predict()-function. Therefore, the model is stored in an RDS-file.

```{r store-model}
saveRDS(km_fit, paste(myfilepath,".rds",sep=""))
```

The stored model has a very large file-size, especially for the weapons-model. This file size is in conflict with github's push limits. Other options like h5-files and deployment with APIs were investigated. The h5-files did not work probably, the API solution with ventier did not ease the file size problem.\
However, the butcher-library can be used to reduce the fitted model to essentials. The following code-chunks show that vital parts of the model are within *pre...recipe.steps.res.rotation* with 135 MiB. Unfortunately regarding the size, those components are not removed with butcher. Therefore, the saved RDS-file is exchanged with GoogleDrive.

:::: column-margin
::: {.callout-note}
## RDS Files
[GoogleDrive](https://drive.google.com/drive/folders/1LQ2Ixx4rwxcAW8TQPocOXQnUKcDCvBSE?usp=sharing){target="_blank"}
:::
::::

```{r butcher-model-one}
weigh(km_fit)
```

```{r butcher-model-two}
stripped_model <- butcher(km_fit)
weigh(stripped_model)
```

## Shiny Web App

This part covers the deployment in a shiny app. For details on the code behind the app, please click on the User Guide link in the right margin and also see comments in the source code of *app.R*. Additionally, the app is deployed on shinyapps.io. It is also possible to access the User Guide on shinyapps.io.

:::: column-margin
::: {.callout-note}
## Shinyapps.io
[myApp](https://thearmbreaker.shinyapps.io/clustering-with-keras/){target="_blank"}\
[myApp: UserGuide](TecDoc_Shiny.html){target="_blank"}
:::
::::

For the deployment following code can be used. To start with the deployment a list of relevant files is created to avoid bundling and publishing of unnecessary files.

```{r create-bundle}
myBundle <- c()
for (i in list.files()) {
  file_format <- strsplit(i,"\\.")[[1]][2]
  if (file_format %in% c("rds","csv","html","R") | ( is.na(file_format) & (i != "old_results" & i != "rsconnect"))) {
    myBundle <- c(myBundle,i)
  } else {
  }
}
myBundle
```

```{r shinyapps-deployment}
#| eval: false
# install.packages('rsconnect')
library(rsconnect)

# rsconnect::setAccountInfo(name='thearmbreaker', token='TOKEN', secret='SECRET')

deployApp(
  appFiles = myBundle,
  appName = "clustering-with-keras",
  upload = TRUE,
  logLevel = "verbose"
  )

#configureApp("app_ImageClustering",size="xlarge")
```

## Description of App

The deployed app was recorded to save results in case of crashes. A youtube video is embedded, in case the upload to moodle has a size limit.  

::: {style="display: flex; justify-content: center;"}
{{< video https://youtu.be/TlOUXbdlIDg width='480' height='360' >}}
:::

In the section **Show Clusters** of the app a bar graph shows the amount of images per cluster for one of the four models. To evaluate the results (similarity of images in clusters) in a heuristic manner, four random images are displayed when a specific cluster is activated.\
The app shows an example for naming the clusters with a text input-field. The required code is not implemented but might look like the below snippet which writes the names to an sql table.

```{r sql-example}
#| eval: false
library(DBI)
cluster_name <- input$Textfield
selected_cluster <- input$var_clus
myDataframe <- tibble(cluster_names=c(cluster_name), clusters=c(selected_cluster))
dbWriteTable(connection,cluster_table,myDataframe, append=TRUE)
```

In the section **Cluster New Image** of the app any image can be uploaded and the predict() function is going to sort it into one of the extracted clusters. Finally, four random images of the predicted cluster are shown for comparison by the human eye.

## Interpreting Results with App

With the deployed app, the unsupervised clustering algorithm can conveniently be evaluated. The following table shows an interpretation of results shown in the app. Please consider, that the shown images might differ in another run, because they represent a sample of four from all images in the cluster.

:::{style="overflow-x:auto;"}
| Cluster         | Cluster_1                        | Cluster_2                    | Cluster_3            | Cluster_4             | Cluster_5              | Cluster_6                                          | Cluster_7                               | Cluster_8                          | Cluster_9                | Cluster_10     |
|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| Flowers         | Yellow                           | Vinous                     | Unspecific               | Daisies (white)            | White                  | Purple                                             | Yellow other blossoms                            | Pink                               | Bright-Red               | Orange         |
| Flowers_predict | Pink             | Orange                       | Unspecific           | White/bright-pink            | Yellow                 | Daisy from side                            | Daisy-ish from above               | Yellow-Violet                             | White-Pink                    | Bright-Yellow other blossoms  |
| Weapons         | Whitehouse Camera footage | Weapon, close range in profile | landscape           | Pistols from side in profile            | Shooting range            | Soldiers | Toys and Noise | Movie footage     | mil equipment in landscape | camera footage |
| Weapons_predict | Unspecific with movie/camera footage           | unspecific                         | Soldiers with weapon | Soldiers with weapon | mil equipment in landscape | movie footage               | Unspecific with toys                          | Shooting range | Shooting range - aiming              | weapon in profile   |

: Label-suggestions for Clusters
:::

Regarding both models of the flower images one can speak about success. However, the dataset description suggested 10 clusters by the labeled species. In comparison of the barplots for flowers and flowers_predict it is visible that flowers_predicts has a very high amount of pictures in Cluster_1. In flowers the images are more evenly spread over the clusters.

For the weapon-images 10 clusters were used, but this should be reviewed more thoroughly. The weapons model shows more evenly spread amount of images per cluster. The weapons_predict shows evenly spread amount of images over clusters 2 to 5 and 7 to 10. But Cluster 1 and 6 are rather large and with a wide variety of images.\
The clusters do not distinguish between pistols or rifles, but they enable the distinction of images with people aiming with a rifle, camera footage, groups of people or other scenic photographs. This can be seen as success to sort out irrelevant images for labeling.\
However, in comparison to the relative clean flower dataset the weapon dataset was very indistinct. Therefore, more clusters or other feature extraction approaches might be useful in the model. Last but not least, another amount of PCA could be reviewed for both datasets.

# Outlook

This project could be improved or further developed. There is potential in the implementation of MLflow to track improvements in PCA, k-means or feature extraction with keras.

-   Using a previous outputlayer in the pre-trained keras model could result in more distinct features for helmets, pistols, rifles, and similar items.
-   Improvements of PCA or K-Means might be tracked with MLflow
-   Uploaded images could be stored in a bucket and used for new clustering training (introduce retrain-triggers)
-   Batch-processing for multiple images with User Upload
-   Load image data from bucket or googledrive. For example with the googledrive library which is part of tidyverse.

Finally, minor technical issues could be addressed. The shiny app uses SelectInput in the UI, which has poor performance by the large amount of clusters. A better approach might be the selectizeInput function which is used on the Server side of the app.\
Currently, the app generates warnings when loading trained clusters while displaying pictures for a cluster that is not included in the loaded cluster. For example one result having only 8 clusters and another 10 clusters. Those warnings do not crash the app, but should be avoided. Furthermore this might enable more clusters, when needed due to tuning.