adding KMeans PyTorch Implementation to cfa model #998

aadhamm · 2023-04-02T23:52:19Z

Description

Provide an implementation to the KMeans clustering algorithm through the PyTorch framework
Fixes # (956)

Changes

Bug fix (non-breaking change which fixes an issue)
Refactor (non-breaking change which refactors the code base)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist

My code follows the pre-commit style and check guidelines of this project.
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing tests pass locally with my changes
I have added a summary of my changes to the CHANGELOG (not for minor changes, docs and tests).

samet-akcay

Thanks for creating this. I wanted to compare the scikit-learn and torch implementations, but had some trouble. Can you share the steps that you mentioned here in this link
#956 (comment)

samet-akcay · 2023-04-04T13:38:49Z

src/anomalib/models/cfa/torch_model.py

@@ -78,6 +78,66 @@ def get_feature_extractor(backbone: str, return_nodes: list[str]) -> GraphModule
    return feature_extractor


+#Kmeans clustering algorithm implementation in PyTorch framework
+class KMeans_torch:


Suggested change

class KMeans_torch:

class KMeans:

KMeans could potentially be used by other algorithms in the future. Therefore, it would be good to create a file, named src/anomalib/models/components/cluster/kmeans.py and move this implementation there.

samet-akcay · 2023-04-04T13:44:13Z

src/anomalib/models/cfa/torch_model.py

@@ -78,6 +78,66 @@ def get_feature_extractor(backbone: str, return_nodes: list[str]) -> GraphModule
    return feature_extractor


+#Kmeans clustering algorithm implementation in PyTorch framework
+class KMeans_torch:
+    def __init__(self, n_clusters, max_iter=10):


We try to use type hints as much as possible, which will be useful for our new CLI that automatically handle the type of the variables.

Suggested change

def __init__(self, n_clusters, max_iter=10):

def __init__(self, n_clusters: int, max_iter:int = 10):

samet-akcay · 2023-04-04T15:26:52Z

src/anomalib/models/cfa/torch_model.py

+
+        #thise line returns labels and centoids of the results, 
+        #alternative to Sklearn's cluster_centers_ & labels_ attributes            
+        return self.cluster_assignments, self.centroids


Would it be possible to return self, similar to the scikit-learn implementation?

aadhamm · 2023-04-04T20:04:31Z

Dear Samet,

thank you for putting such effort into review and feedback,

of course, all of your requested changes are possible and I will start working on them and commit changes ASAP, I will also provide you with the full code/notebook I used for comparison between the two implementations.

in the meantime, I potentially looking to address some other TO-DO issues either were the other two issues in cfa model or issues in other models as well, if that's possible and eligible of course.

Thank you.

aadhamm · 2023-04-10T00:13:41Z

Dear @samet-akcay,

please forgive me for the late reply, due to several engagements in the last few days,

here is my full approach to comparing the Kmeans Sklean and PyTorch implementations through silhouette score, which is a metric used to calculate the goodness of a clustering technique. Its value ranges from -1 to 1.

It contains the modifications you asked for, this is a preview, I will commit the changes as required and I may add a running notebook with the following comparison code, if it's not necessary, I will delete it.

Thank you

# Import necessary libraries
import torch
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import numpy as np
from sklearn.metrics import silhouette_score
##sklearn KMeans is imported lately so it's class not be confused with pytorch class

class KMeans:
    def __init__(self, n_clusters: int, max_iter:int = 10):
        
        """
        Initializes the KMeans object.

        Parameters:
            n_clusters: The number of clusters to create.
            max_iter: The maximum number of iterations to run the algorithm for.
        """
        self.n_clusters = n_clusters
        self.max_iter = max_iter

    def fit(self, X):
        """
        Runs the k-means algorithm on input data X.

        Parameters:
            X: A tensor of shape (N, D) containing the input data.
            N is the number of data points 
            D is the dimensionality of the data points.
        """
        N, D = X.shape

        # Initialize centroids randomly from the data points
        centroid_indices = torch.randint(0, N, (self.n_clusters,))
        self.cluster_centers_ = X[centroid_indices]

        # Run the k-means algorithm for max_iter iterations
        for i in range(self.max_iter):
            # Compute the distance between each data point and each centroid
            distances = torch.cdist(X, self.cluster_centers_)

            # Assign each data point to the closest centroid
            self.labels_ = torch.argmin(distances, dim=1)

            # Update the centroids to be the mean of the data points assigned to them
            for j in range(self.n_clusters):
                mask = self.labels_ == j
                if mask.any():
                    self.cluster_centers_[j] = X[mask].mean(dim=0)
                    
        #thise line returns labels and centoids of the results,          
        return self.labels_, self.cluster_centers_

    def predict(self, X):
        """
        Assigns each data point in X to its closest centroid.

        Parameters:
            X: A tensor of shape (N, D) containing the input data.

        Returns:
            A tensor of shape (N,) containing the index of the closest centroid for each data point.
        """
        distances = torch.cdist(X, self.cluster_centers_)
        return torch.argmin(distances, dim=1)



# Generate sample data
X, _ = make_blobs(n_samples=1000, centers=5, n_features=10, random_state=42)

X_torch = torch.tensor(X, dtype=torch.float32)


# Set parameters
n_clusters = 3
max_iter = 3000

# Run KMeans using Torch implementation
kmeans_torch = KMeans(n_clusters=n_clusters, max_iter=max_iter)
kmeans_torch.fit(X_torch)


from sklearn.cluster import KMeans
# Run KMeans using scikit-learn implementation
kmeans_sklearn = KMeans(n_clusters=n_clusters, max_iter=max_iter)
kmeans_sklearn.fit(X)



sklearn_labels = kmeans_sklearn.labels_
pytorch_labels = kmeans_torch.labels_

sklearn_centers = kmeans_sklearn.cluster_centers_
pytorch_centers = kmeans_torch.cluster_centers_



# Check that the results are the same
print("Scikit-learn centroids:\n", sklearn_centers)
print("Torch centroids:\n", pytorch_centers)

print("Scikit-learn labels:\n", sklearn_labels)
print("Torch labels:\n", pytorch_labels) 



import numpy as np
from sklearn.metrics import silhouette_score


# Calculate the silhouette score
silhouette_sklearn = silhouette_score(X, sklearn_labels)

# Calculate the silhouette score
silhouette_torch = silhouette_score(X_torch, pytorch_labels)


# Print the comparison results
print(f"Silhouette score for Sklearn: {silhouette_sklearn}")
print(f"Silhouette score for Pytorch: {silhouette_torch}")

review-notebook-app · 2023-04-10T01:19:19Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

aadhamm · 2023-04-24T09:32:26Z

@samet-akcay, I wanted to ask if this to-do task, TODO: Replace this with the new torchfx feature extractor has already been done? because I am willing to continue with the rest of them in the same CFA implementation.

As far as I know, create_feature_extractor() is the new fx feature extractor.

aadhamm · 2023-04-24T09:49:39Z

@samet-akcay, if it's already been done, I may contribute to this task, # TODO: Automatically infer the number of dims.

samet-akcay · 2023-04-24T10:06:02Z

@samet-akcay, I wanted to ask if this to-do task, TODO: Replace this with the new torchfx feature extractor has already been done? because I am willing to continue with the rest of them in the same CFA implementation.

As far as I know, create_feature_extractor() is the new fx feature extractor.

@ashwinvaidya17, can you help here to show how @aadhamm can utilize the new feature extractor for the cfa model

ashwinvaidya17 · 2023-04-24T12:00:52Z

@samet-akcay, I wanted to ask if this to-do task, TODO: Replace this with the new torchfx feature extractor has already been done? because I am willing to continue with the rest of them in the same CFA implementation.

As far as I know, create_feature_extractor() is the new fx feature extractor.

You can have a look here to see how it is done. You can adapt your code to call this class.

anomalib/src/anomalib/models/csflow/torch_model.py

Lines 475 to 476 in 4fe6c74

    
           self.feature_extractor = TorchFXFeatureExtractor( 
        
               backbone="efficientnet_b5", weights=EfficientNet_B5_Weights.DEFAULT, return_nodes=["features.6.8"]

If you still have any questions then feel free to ask

ashwinvaidya17

Thanks for the efforts I have a few comments.

Can you remove the jupyter notebook from the components folder. If you want to confirm the outputs then you can move this code into a file and turn it into a unit test.
Can you also address the issues raised by codacy.
The spinx parser is configured to follow the Google's docstring format. Can you update the docstrings to conform to this format? Here is an example https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html

ashwinvaidya17 · 2023-04-24T11:43:25Z