-
Notifications
You must be signed in to change notification settings - Fork 0
/
UseCase.qmd
565 lines (420 loc) · 27.2 KB
/
UseCase.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
---
title: "Image Clustering in R"
subtitle: "with Keras, PCA and k-means"
author: "Markus Armbrecht"
date: "`r Sys.Date()`"
format:
html:
page-layout: article
toc: true
link-external-newwindow: true
toc-title: "Image Clustering in R"
toc-location: left
editor: source
callout-appearance: simple
execute:
warning: false
---
```{r libs}
#| include: false
library(reticulate)
library(keras)
library(tensorflow)
library(tidyverse)
library(tidymodels)
library(tidyclust)
library(recipes)
library(jpeg)
library(butcher)
library(ggplot2)
#library(factoextra)
#library(cluster)
```
```{r helper-functions}
#| include: false
# Helper function to use sys.time in filenames on windows and mac
# Windows cannot handle ":" in filenames.
get_time <- function(){
myTime <- Sys.time()
myTime <- gsub(":","_",myTime)
myTime <- gsub(" ","_",myTime)
myTime <- gsub("-","_",myTime)
return (myTime)
}
```
# Introduction
The requirement of self-labeled image data became obvious, when I was researching a project for my masterthesis. I was looking for a way to automatically sort images by their content and found a project from Gabo Flomo on [(towardsdatascience.com: 2020)](https://towardsdatascience.com/how-to-cluster-images-based-on-visual-similarity-cd6e7209fe34), in which flower-images are clustered by similarity.
In this R learning project the python code from Gabo Flomo is used as blueprint. It is translated to R and incorporated to a Data Science Life Cycle.
:::: column-margin
::: {.callout-note}
## Link
[GitHub Repo](https://github.com/TheArmbreaker/clustering-with-keras){target="_blank"}
:::
::::
To following wording is used to support easier reading and unique notation. When speaking of the flower images dataset, which was originally used by Gabo Flomo, the phrase flower data is used. For the later introduced dataset on weapons in images the phrase weapons data is used. Additionally, the resulting clusters and models are names flower model and weapon model. This is especially important as two different clustering approaches will be compared and for either flower model and weapon model there will be a prediction model. Such prediction models will be named flower prediction model and weapon prediction model.
Overview of names for **flower** and **weapon**
- *data* refers to dataset.
- *model* refers to model of clustering approach in BaseR.
- *prediction model* refers to model of clustering in tidyclust and recipe.
## Datasets
The images can be found here:
- Weapons in Images on [kaggle.com](https://www.kaggle.com/datasets/jubaerad/weapons-in-images-segmented-videos){target="_blank"}
- Flower Color Images on [kaggle.com](https://www.kaggle.com/datasets/olgabelitskaya/flower-color-images){target="_blank"}
To execute the code on a locally run shiny web app the data shall be copied in folders "flowers" and "weapons" of the Rproject folder. From the dataset-downloads the sub folders *flower_images* and *Weapons-in-Images* are required.
## Environment
From experience in past projects it is known that my daily working computer is not able to perform image processing in keras without crashing. Therefore, solutions like Amazon Sagemaker Studio Lab, Databricks Community Edition and Amazon Sagemaker (without Studio Lab) were explored.
While Sagemaker Studio Lab only supported Python, the Amazon Sagemaker solution was promising, despite generating a small amount of costs within the free usage limits of AWS. Unfortunately, loading a pre-trained keras model within an Jupyter Lab environment of Amazon Sagemaker returned twisted input shapes and therefore the code crashed when trying to insert images. Interestingly R instances on local machines would provide a model with correct input-shapes. Thus, an stackoverflow-question was released for clarification of a bug. Please follow the links in the margin for more details.
:::: column-margin
::: {.callout-note}
## Stackoverflow
[1. my Question and my Answer](https://stackoverflow.com/questions/75988301/vgg16-different-shape-between-r-and-python-how-to-deal-with-that){target="_blank"}\
[2. Example Images InputLayer](https://imgur.com/a/5dOaJWf){target="_blank"}
:::
::::
The Databricks environment was registered via the community edition link. However it skyrocketed costs at AWS within one night so that the approach had to be canceled.
Finally, this project is implemented on a Windows Gaming PC for modeling activities and a MacBook for writing code and text. This approach is supported by a Github Repository to exchange files and results. File sizes that exceed the github limit are exchanged with a GoogleDrive.
Following code chunk is setting the python environment to be used by Keras and Tensorflow functions in R.
```{r prepare-env}
use_condaenv("condatascience") # environment on my Windows Machine
myPy <- py_config()$python
myPy
```
The following code confirms successful loading of Tensorflow.
```{r confirm-tf}
tf$constant("Hello Tensorflow!")
```
# Plan
## Use Case
The thesis subject is going to be about object detection on military personal. For such topics almost no dataset is publicly available. Thus, images from datasets have to be examined and labeled manually.\
The vast majority and variety of unlabeled images should not be evaluated picture by picture. Instead unsupervised learning with clustering shall be used to support the understanding of the image contents. This might help to sort out irrelevant images for labeling and also to create image-categories.
## Procedure
The flower data is used to write working code and navigate any differences in the coding languages. In a an early prototype Base R code was used for clustering and later transferred to tidyclust and recipe to generate a model for prediction on new images. However, during the transformation differences in the results were noticed and both codes and results kept for evaluation.
The flower and weapon data hold information in form of features which have to be extracted before clustering. This tasks will be done by the pre-trained imagenet-vgg16 model which is available in the keras library. In accordance to the flower data blueprint from Gabo Flomo, the last output layer is omitted and provides image features. Those are forwarded to pca and kmeans clustering to support the user's activities of understanding data contents. All resulting clusters and models are deployed in a shiny web app. It would have been possible to evaluate all models before deployment, but the user shall decide which clusters are better and therefore needs access to all results.
![Procedure Image](procedure_example.png){style="display: flex; justify-content: center;text-align:center;"}
## Anticipated Outcomes
Anticipated outcomes for exploratory activities on the flower data are separate clusters for yellow and white flowers. Regarding the weapon data the cluster might separate images of pistols and rifles, which needs a large amount of tinkering. More realistic results are clusters on close weapons, people with weapons, weapons in front of a landscape and camera angles.
The flower data suggests 10 different clusters by named species in the dataset description. For the weapon data the amount of clusters is unknown and will be selected based on an elbow-curve.
Following example images show that all flower images are made in close range, while weapon data has a large variety of objects, like soldiers, toys, persons aiming, persons not aiming, vehicles and landscapes.
::: {#fig-flowers layout-nrow="2" style="text-align:center;"}
![Example 1](flowers/0001.png)
![Example 2](flowers/0009.png)
![Example 3](flowers/0002.png)
![Example 4](flowers/0003.png)
Example Images Flowers
:::
::: {#fig-weapons style="text-align:center;" layout="[[1,1], [1],[1]]"}
![Example 1](weapons/ee619b2e6f861f1e.jpg){width="224"}
![Example 2](weapons/e9552b04f79630ee.jpg){width="224"}
![Example 3](weapons/beef33684f8a177e.jpg){width="480"}
![Example 4](weapons/c1ac27d27c37d962.jpg){width="480"}
Example Images Weapons
:::
# Data
## Data Ingestion
To prevent conflicts in the loaded libraries the above mentioned Base R approach is provided in another document, which is linked in the margin column on the right. This document focuses on the approach with tidyverse, tidyclust and recipe.
:::: column-margin
::: {.callout-note}
## Link
[Base R Code](clustering_code.html){target="_blank"}
:::
::::
The following approach will result in a pipeline, which can be used for deployment and prediction. The Base R code will not use a pipeline or prediction function and only provides results for the clusters.
The following code enables to switch between the flowers and weapons data, without changing other code chunks in the later process.
```{r}
# Switch to toggle between a prototype flower-images and the weapon images to be clustered
flower_data <- FALSE
if (flower_data){
myPath <- "flowers"
print("Flower Data loaded.")
} else {
myPath <- "weapons"
print("Weapon Data loaded.")
}
# Actual loading of files in path to a list
myFiles <- list.files(myPath)
```
## Data Split
The data will not be splitted for this unsupervised model training with subjective evaluation by the user.
A split of data at any point in this project would prevent overfitting in the model. However, in this rare Use Case and for the given data such overfitting is wanted for most uniqueness in clusters. However, this is not appropriate for the prediction on new images.
By *any point* it is meant that splits are neither performed on images, feature-arrays and principal components.
## Data Analysis
Before the features are extracted a feeling for the files themselves shall be developed. Therefore the dimensions for each image are plotted. Images with less than 840px in each dimension are labeled as small and other images as large.
The following plot shows the width and height in an xy-plot with colored labels. The black lines show the median-value for the respective dimension. The blue line is showing a ratio of 1 between x and y.\
The plot shows that a lot of pictures have 1024px on at least one dimension and smaller size on the other. Looking at the median value of x it can be seen that this is larger than 1024. This is because of overlaps in the plot. In comparison to all pictures there is high amount of pictures having a width of 1280px.
```{r data-exploration}
#| cache: true
get_num_pixels <- function(filepath) {
img <- readJPEG(filepath)
width <- dim(img)[2]
height <- dim(img)[1]
return(list(width = width, height = height))
}
width_size <- list()
height_size <- list()
size_cat <- list()
for (file in myFiles) {
path <- paste(myPath,"/",file,sep="")
dimensions <- get_num_pixels(path)
width <- dimensions$width
height <- dimensions$height
if (width <= 840 && height <= 840) {
size_cat <- c(size_cat, 'small')
} else {
size_cat <- c(size_cat, 'large')
}
width_size <- c(width_size, width)
height_size <- c(height_size, height)
}
myDF <- tibble(x=as.numeric(width_size),y=as.numeric(height_size),z=as.character(size_cat))
ggplot(myDF, aes(x,y,colour=z)) +
geom_point() +
geom_abline(intercept=0,slope=1,colour="blue") +
geom_vline(xintercept = mean(myDF$x), linetype = "dashed", colour = 'black', linewidth = 1) +
geom_hline(yintercept = mean(myDF$y), linetype = "dashed", colour = 'black', linewidth = 1) +
labs(x = "Width", y = "Height") +
xlim(0,1300) +
ylim(0,1300)
```
Following statements support the discussed overlap for images haveing a width equal to 1280px. There are more than two thousand images with width 1280. Note that only the x-axis is counted, because a fixed width of 1280 still has different sizes in height.
```{r file-size-counts}
myDF |> count(x) |> arrange(desc(n)) |> top_n(3,n)
```
Finally, a quick overview on statistics. Those confirm the plotted median-values and also show the mean being a bit smaller than the median.
```{r file-size-summary}
summary(myDF)
```
## Feature Extraction
To extract features for clustering the following code-chunk is based on keras-functions to load an image, transform it to an array that is suitable for the pretrained model and the prediction is performed for that image. The code returns the features of the image.
```{r feature-extract}
get_features <- function(image_file,myModel){
# takes: image file and model
# returns: feature vector for the provided image
# description:
# the image is loaded with color and scaled to 224x224 px.
# In the next step the image is converted to an array and this array is prepared for prediction on the provided model
img <- image_load(image_file, grayscale=FALSE,target_size = c(224,224))
img_array <- image_to_array(img)
reshaped_image_array <- array_reshape(img_array,c(1,dim(img_array)))
prepro_img <- imagenet_preprocess_input(reshaped_image_array)
features <- myModel |> predict(prepro_img)
return(features)
}
```
:::: column-margin
::: {.callout-tip}
## Tip keras
Note that in R it is image_load() while in python it is img_load().
:::
::::
In the code-chunk above the pre-trained model is provided as argument. This enables changing the model for later improvement and testing other models - target_size has to be adjusted manually or programed as argument. The following code-chunk is loading the pre-trained model VGG16 from the keras library. Note, that the output layer is omitted and therefore the second last layer will provide features as tensor instead of classification result.
```{r load-keras-vgg16}
# Load model
model <- application_vgg16(weights="imagenet",include_top=TRUE)
# Change Output to second last layer to access the feature map instead of classification result.
output <- model$layers[[length(model$layers)-1]]$output
model <- keras_model(inputs=model$input, outputs=output)
```
In the following code-chunk every image file is processed with above described function to populate a list of features and transform it to an matrix with 4096 features per row.
```{r extract-features}
#| cache: true
# empty list to hold extracted features per image
myFeatures <- list()
# fill list with features by looping through file list and calling the function get_features
for (img_file in myFiles){
# create path and call function
path <- paste(myPath,"/",img_file,sep="")
feat <- get_features(path,model)
# append List with features
len <- length(myFeatures)
myFeatures[[len+1]]<-feat
}
myFeatArray <- matrix(unlist(myFeatures), ncol=4096, byrow=TRUE)
dim(myFeatArray)
```
Finally, the extracted features are transformed to a dataframe.
```{r prepare-features}
# transform myFiles list to a single column dataframe
file_names <- as.data.frame(myFiles)
# create a dataframe with pcomp per row
file_features <- as.data.frame(myFeatArray)
# append both dataframes to one (no join due to missing parameter)
df_data <- bind_cols(file_names,file_features)
```
## Feature Engineering
The 4096 features per images are reduced with principal component analysis. The blueprint value of 100 components is used and might be adjusted for model improvements.
The code below creates a recipe with a single pca_step. A normalisation was not applied, because for some vectors the sum is zero and this would yield an error for division by zero. The omittance of normalisation is supported by the logic of image representation in arrays, which is already using a common scale for RGB-Value representation.
```{r pca-code}
#| cache: true
# Create recipe without label column by tilde-symbol
img_recipe <- recipe(~.,data=file_features) |>
step_pca(all_numeric(),num_comp=100)
```
# Model
## Optimal K
Before clustering with the pca-results, the amount of clusters are set. There is an old thumb rule called square-root-rule,to predict the amount of clusters, which is not very precise for large amount of data scientist observe today, but it can be used to set a range for the elbow curve. Thus, this value could be used as maximum for the following tuning function.
```{r old-optimal-k}
sqrt(nrow(df_data))
```
The tidyverse and tidyclust libraries provide a tuning function that works with evaluation functions like bootstrapping or cv-folds.\
CV-folds is not used to achieve the wanted over fit on training data. Bootstrapping is used as applicable evaluation method. It generates a sample data in the size of original data by taking samples with replacement. The times argument is deciding how often this happens. However, using the standard settings for bootstrapping requires a very long run time. Therefore, not necessarily optimal values in *times* are used for this project.
Furthermore the workflow function from the recipe library is introduced, which can be used instead of preparation and baking. In this function the recipe and the model are provided to tune_cluster().
```{r kmean-tuning}
#| cache: true
# generate k_means function with tuneing for optimal k.
optK <- k_means() |>
set_args(num_clusters=tune())
# generate workflow with preprocessing recipe and model
optK_wf <- workflow(img_recipe,optK)
# set bootstraps or cv-folds
myBoots <- bootstraps(df_data,times=1)
# myCVfolds <- vfold_cv(df_data,v=5)
start <- Sys.time()
# execute cluster tuning and store results
tune_res <- tune_cluster(
optK_wf,
resample = myBoots,
metrics = cluster_metric_set(sse_within_total,sse_total,sse_ratio),
grid = expand.grid(num_clusters=1:20)
)
end <- Sys.time()
# retrieve tuning results
myMetrics <- collect_metrics(tune_res)
print(end-start) # no additional text required for output
```
:::: column-margin
::: {.callout-note}
## Info
tune_cluster() with cvfold and bootstrapping is shown for completeness of the R learning project.
:::
::::
```{r kmean-elbow}
myMetrics |>
filter(.metric == "sse_within_total") |>
ggplot(aes(x = num_clusters, y = mean)) +
geom_point() +
geom_line() +
theme_minimal() +
ylab("mean WSS") +
xlab("Number of clusters") +
scale_x_continuous(limits = c(0,20))
```
## K-Means - Fit
After the optimal k was established, the model function can be provided with a number of clusters to generate. This will be introduced to a new workflow, which then is used for the fit.
```{r kmean-fit}
#| cache: true
# generate k_means function with optimal k value
km <- k_means() |>
set_args(num_clusters=10)
# generate workflow with preprocessing recipe and model
km_wf <- workflow(img_recipe,km)
# fit the model to the data
km_fit <- fit(km_wf,data=df_data)
```
# Deployment
## Extraction
Usually model accuracy is reviewed before deployment, but the model is meant to support the exploration of a large amount of images. Therefore, the model results are reviewed and discussed in the deployment chapter of this work.
Before review the cluster assignments are mapped to the image-files. Additionally, the result-table is stored in an csv-file. This will enable the user of the deployed data to look at a cluster's contents.
```{r extract-cluster}
# string for filepath
myfilepath <- paste("recipe_",myPath,"_",get_time(),"_cluster",sep="")
# extracting the cluster-assignments and bind them to the filename dataframe
df_clustered <- bind_cols(file_names,extract_cluster_assignment(km_fit))
# storing results in csv file
write_csv(df_clustered,paste(myfilepath,".csv",sep=""))
# returning results
df_clustered |>
head(3)
```
The model shall also be used for clustering of new images with the predict()-function. Therefore, the model is stored in an RDS-file.
```{r store-model}
saveRDS(km_fit, paste(myfilepath,".rds",sep=""))
```
The stored model has a very large file-size, especially for the weapons-model. This file size is in conflict with github's push limits. Other options like h5-files and deployment with APIs were investigated. The h5-files did not work probably, the API solution with ventier did not ease the file size problem.\
However, the butcher-library can be used to reduce the fitted model to essentials. The following code-chunks show that vital parts of the model are within *pre...recipe.steps.res.rotation* with 135 MiB. Unfortunately regarding the size, those components are not removed with butcher. Therefore, the saved RDS-file is exchanged with GoogleDrive.
:::: column-margin
::: {.callout-note}
## RDS Files
[GoogleDrive](https://drive.google.com/drive/folders/1LQ2Ixx4rwxcAW8TQPocOXQnUKcDCvBSE?usp=sharing){target="_blank"}
:::
::::
```{r butcher-model-one}
weigh(km_fit)
```
```{r butcher-model-two}
stripped_model <- butcher(km_fit)
weigh(stripped_model)
```
## Shiny Web App
This part covers the deployment in a shiny app. For details on the code behind the app, please click on the User Guide link in the right margin and also see comments in the source code of *app.R*. Additionally, the app is deployed on shinyapps.io. It is also possible to access the User Guide on shinyapps.io.
:::: column-margin
::: {.callout-note}
## Shinyapps.io
[myApp](https://thearmbreaker.shinyapps.io/clustering-with-keras/){target="_blank"}\
[myApp: UserGuide](TecDoc_Shiny.html){target="_blank"}
:::
::::
For the deployment following code can be used. To start with the deployment a list of relevant files is created to avoid bundling and publishing of unnecessary files.
```{r create-bundle}
myBundle <- c()
for (i in list.files()) {
file_format <- strsplit(i,"\\.")[[1]][2]
if (file_format %in% c("rds","csv","html","R") | ( is.na(file_format) & (i != "old_results" & i != "rsconnect"))) {
myBundle <- c(myBundle,i)
} else {
}
}
myBundle
```
```{r shinyapps-deployment}
#| eval: false
# install.packages('rsconnect')
library(rsconnect)
# rsconnect::setAccountInfo(name='thearmbreaker', token='TOKEN', secret='SECRET')
deployApp(
appFiles = myBundle,
appName = "clustering-with-keras",
upload = TRUE,
logLevel = "verbose"
)
#configureApp("app_ImageClustering",size="xlarge")
```
## Description of App
The deployed app was recorded to save results in case of crashes. A youtube video is embedded, in case the upload to moodle has a size limit.
::: {style="display: flex; justify-content: center;"}
{{< video https://youtu.be/TlOUXbdlIDg width='480' height='360' >}}
:::
In the section **Show Clusters** of the app a bar graph shows the amount of images per cluster for one of the four models. To evaluate the results (similarity of images in clusters) in a heuristic manner, four random images are displayed when a specific cluster is activated.\
The app shows an example for naming the clusters with a text input-field. The required code is not implemented but might look like the below snippet which writes the names to an sql table.
```{r sql-example}
#| eval: false
library(DBI)
cluster_name <- input$Textfield
selected_cluster <- input$var_clus
myDataframe <- tibble(cluster_names=c(cluster_name), clusters=c(selected_cluster))
dbWriteTable(connection,cluster_table,myDataframe, append=TRUE)
```
In the section **Cluster New Image** of the app any image can be uploaded and the predict() function is going to sort it into one of the extracted clusters. Finally, four random images of the predicted cluster are shown for comparison by the human eye.
## Interpreting Results with App
With the deployed app, the unsupervised clustering algorithm can conveniently be evaluated. The following table shows an interpretation of results shown in the app. Please consider, that the shown images might differ in another run, because they represent a sample of four from all images in the cluster.
:::{style="overflow-x:auto;"}
| Cluster | Cluster_1 | Cluster_2 | Cluster_3 | Cluster_4 | Cluster_5 | Cluster_6 | Cluster_7 | Cluster_8 | Cluster_9 | Cluster_10 |
|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| Flowers | Yellow | Vinous | Unspecific | Daisies (white) | White | Purple | Yellow other blossoms | Pink | Bright-Red | Orange |
| Flowers_predict | Pink | Orange | Unspecific | White/bright-pink | Yellow | Daisy from side | Daisy-ish from above | Yellow-Violet | White-Pink | Bright-Yellow other blossoms |
| Weapons | Whitehouse Camera footage | Weapon, close range in profile | landscape | Pistols from side in profile | Shooting range | Soldiers | Toys and Noise | Movie footage | mil equipment in landscape | camera footage |
| Weapons_predict | Unspecific with movie/camera footage | unspecific | Soldiers with weapon | Soldiers with weapon | mil equipment in landscape | movie footage | Unspecific with toys | Shooting range | Shooting range - aiming | weapon in profile |
: Label-suggestions for Clusters
:::
Regarding both models of the flower images one can speak about success. However, the dataset description suggested 10 clusters by the labeled species. In comparison of the barplots for flowers and flowers_predict it is visible that flowers_predicts has a very high amount of pictures in Cluster_1. In flowers the images are more evenly spread over the clusters.
For the weapon-images 10 clusters were used, but this should be reviewed more thoroughly. The weapons model shows more evenly spread amount of images per cluster. The weapons_predict shows evenly spread amount of images over clusters 2 to 5 and 7 to 10. But Cluster 1 and 6 are rather large and with a wide variety of images.\
The clusters do not distinguish between pistols or rifles, but they enable the distinction of images with people aiming with a rifle, camera footage, groups of people or other scenic photographs. This can be seen as success to sort out irrelevant images for labeling.\
However, in comparison to the relative clean flower dataset the weapon dataset was very indistinct. Therefore, more clusters or other feature extraction approaches might be useful in the model. Last but not least, another amount of PCA could be reviewed for both datasets.
# Outlook
This project could be improved or further developed. There is potential in the implementation of MLflow to track improvements in PCA, k-means or feature extraction with keras.
- Using a previous outputlayer in the pre-trained keras model could result in more distinct features for helmets, pistols, rifles, and similar items.
- Improvements of PCA or K-Means might be tracked with MLflow
- Uploaded images could be stored in a bucket and used for new clustering training (introduce retrain-triggers)
- Batch-processing for multiple images with User Upload
- Load image data from bucket or googledrive. For example with the googledrive library which is part of tidyverse.
Finally, minor technical issues could be addressed. The shiny app uses SelectInput in the UI, which has poor performance by the large amount of clusters. A better approach might be the selectizeInput function which is used on the Server side of the app.\
Currently, the app generates warnings when loading trained clusters while displaying pictures for a cluster that is not included in the loaded cluster. For example one result having only 8 clusters and another 10 clusters. Those warnings do not crash the app, but should be avoided. Furthermore this might enable more clusters, when needed due to tuning.