Skip to content

hacetin/kmeans-and-agglomerative

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repo includes the codes that I used in Machine Learning course for K-Means and Agglomerative Clustering. To see results quickly, use sample2.jpg instead of sample.jpg.

Results

K K-Means Clustering Agglomerative Clustering
2 kmeans_2_clusters.png agg_2_clusters.png
4 kmeans_4_clusters.png agg_4_clusters.png
8 kmeans_8_clusters.png agg_8_clusters.png
16 kmeans_16_clusters.png agg_16_clusters.png

Original Image(sample.jpg): sample.jpg

K-Means Clustering (part1.py)

Implementation

Here are the steps of the k-means algorithm:

  1. Find the mean and the standard deviation (std) of the given data (im2 in the code).
  2. Generate K random points. For each random point r, choose a random initial centroid as: “r * std + mean”.
  3. Find all distances between data points and cluster centroids, and assign all points to the cluster with the closest centroid.
  4. With labeled points, find mean values for each label (cluster). Keep old centroids in a temporary variable and assign these new mean values to the real centroid list.
  5. Calculate error as the norm of difference vector between new and old centroids. If error is zero, end the process, and return label array and centroid array. Else, go to 3rd step.

Agglomerative Clustering (part2.py)

Implementation

This is a centroid-linked agglomerative clustering. It just stores the centroid data point and the number of items for each cluster. Here are the steps of the agglomerative clustering algorithm:

  1. Define each data point as a different cluster, and the point itself as the centroid of the cluster. Also, assign 1 to item count, and initialize a label array showing centroid indexes in the centroid array.
  2. Find the closest two centroids (minimum distance, maximum similarity).
  3. Add the higher indexed cluster to the lower indexed cluster by combining centroids with average mean and summing item counts.
  4. Replace all labels of the higher indexed cluster with labels of the lower indexed cluster in the labels array.
  5. If the length of the centroids array is equal to K, end the process, and return label array and centroid array. Else, go to 2nd step.

Speeding Up the Algorithm

To speed up the algorithm, it doesn't use all data points at the beginning. It starts with a number of clusters initially at every iteration, and ran the agglomerative algorithm until some point. I use 100 for this project. It is like mini-batch size. Large batch size slows down agglomerative clustering algorithm, small btach size increases the number of iterations. Here are the steps for speeding up part:

  1. Run the agglomerative clustering algorithm with 100 data points initially until it produces K clusters.
  2. Keeping the cluster information from previous step, run the agglomerative clustering algorithm with new (100-K) data points and existing K points.
  3. If data points array comes to the end, finish the process. Else, go to 2nd step.

Releases

No releases published

Packages

No packages published

Languages