(Year 1 Data Science(HVE) course assignment 2022)
(Addressing outdated logic in this code to improve efficiency. Refactoring in progress to address god object concerns, inefficient loops in data processing, enhance modularity, etc)
-
Takes in pre-processed data (no NaNs, encoded);
-
Scales the data;
-
Makes 2D UMAP embedding;
-
Performs DBSCAN and AgglomerativeClusterer hyperparameter tuning (for-loops);
-
Runs DBSCAN and AgglomerativeClusterer on the data, appends obtained cluster labels to the original dataframe;
-
Plots the results (basic Dash Plotly dashboard)
- 3 exploratory scatterplots (UMAP data embedding, Dbscan clustering results on the embedding, Agglomerative clustering results on the embedding) with some clustering evaluation metrics displayed (Silhouette, Davis-Bouldin, Calinski-Harabasz)
- 2 callback-changeable plots: bar- and donut chart (displays feature distribution per chosen algorithm, per chosen cluster)