Skip to content

Implementation of "AutoML4Clust: Efficient AutoML for Clustering Analyses", published at EDBT 2021.

Notifications You must be signed in to change notification settings

tschechlovdev/Automl4Clust

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

AutoML4Clust

Repository for the prototypical implementation of AutoML4Clust. It implements implements different instantiations of our proposed AutoML4Clust. To this end, we use different state-of-the art optimizers from existing AutoML systems and apply them to the unsupervised task of clustering. Furthermore, we implemented meta-learning for the unsupervised task of clustering to warmstart the optimizers. In this prototype, we focus on k-center algorithms. Furthermore, other clustering algorithms, e.g., from other clustering families, can be seamlessly added.

Prerequisites

To use the AutoML4Clust API, you require Python 3.6 and a Linux environment with Ubuntu >= 16.04. You can find and install the required libraries from the requirements.txt file.

Optimizers

We used four state-of-the-art optimizers from existing AutoML systems with the following implementations:

For warmstarting Hyperband and BOHB we used the code provided by the BOHB authors here.

Clustering algorithms and metrics

For the clustering algorithms and metrics, we rely on the prominent ml-library scikit-learn. To this end, we used the k-Means, MiniBatchK-Means, GMM and k-Medoids algorithms, which rely on different objective functions, thus achieving different clustering results. For the k-Medoids algorithm, we used scikit-learn extra implementation.

We also used the three internal metrics that are implemented in scikit-learn for clustering, i.e., the Calinski-Harabasz, Davies-Bouldin Index and the Silhouette.

Simple API Example

The API is very simple to use. In the following, we show examples on how to run the API on a synthetically generated dataset. Per default, our API uses the Calinski-Harabasz metric, a budget of n_loops=60, the four above-mentioned clustering algorithms and a range for the hyperparameter k of (2, n/10), where n is the number of entities in the dataset. However, in the Examples directory you can find examples on how to use another (i) configuration space, (ii) clustering metric, (iii) budget, and (iv) how to use warmstarting.

from sklearn.datasets import make_blobs

from Optimizer.Optimizer import SMACOptimizer, RandomOptimizer, BOHBOptimizer, HyperBandOptimizer

# Create a testing data set for all examples
X, y = make_blobs(n_samples=1000, n_features=10)

# optimizers that can be used in our implementation
optimizers = [RandomOptimizer, SMACOptimizer, HyperBandOptimizer, BOHBOptimizer]

# We use Hyperband in our examples
optimizer = HyperBandOptimizer

# Instantiating AutoML4Clust on the dataset and getting the best found configuration
automl_four_clust_instance = optimizer(dataset=X)
result = automl_four_clust_instance.optimize()
best_configuration = automl_four_clust_instance.get_best_configuration()

Since the API is simple to use, it can be easily integrated into existing analysis pipelines.

About

Implementation of "AutoML4Clust: Efficient AutoML for Clustering Analyses", published at EDBT 2021.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages