Add UMAP embedding visualisation script #55

amorehead · 2023-10-27T02:34:05Z

Adds a visualise.py script that lets one load any pre-trained encoder checkpoint, embed an entire dataset as a collection of graph embeddings, and then plot the collected embeddings using a Gaussian flavour of UMAP.
Lists the top 20 most common label string values in the legend of the resulting figure. For example, when running this script with the settings python proteinworkshop/visualise.py ckpt_path=$MY_PT_CA_BB_CKPT_PATH plot_filepath=output_visualisations/pt_encoder_fold_superfamily.png dataset=fold_superfamily encoder=gcpnet features=ca_bb task=multiclass_graph_classification, this yields the top 20 most common superfamilies being identified in the figure's legend.
Plots may be adjusted as desired by the user. The core logic for acquiring dataset graph embeddings is provided (i.e., batteries included).

… fix to process dataset

amorehead · 2023-10-27T03:35:57Z

I am currently facing an odd indexing error when I run for the full fold_superfamily dataset. On a subset of the dataset, I am seeing clusterings such as the following, for reference.

amorehead · 2023-10-27T03:37:57Z

A note for later: One idea for quantifying each encoder's clustering ability would be to report each method's Dunn Index and Silhouette value.

a-r-j · 2023-10-27T08:29:02Z

LGTM so far. I'll add a property to the datamodules for mapping labels to their corresponding text where possible.

a-r-j · 2023-10-27T16:05:23Z

How are devices handled? Normally the L.Trainer would handle this but we're not using that here. I suppose this is all happening on the CPU?

…ix ndarray shape bug

amorehead · 2023-10-27T18:13:38Z

Great point. I completely missed that the trainer wasn't being instantiated here. No wonder my initial runs of the script were so slow. In this commit, I've added GPU support for both embed.py and visualise.py. Moreover, after timing visualise.py with GCPNet on the fold_superfamily dataset, I was able to create the plot below in around 10 minutes using a GPU.

Overall, it looks like the b.1 superfamily (i.e., the most common superfamily in the dataset) is clustered pretty well.

…t for embed.py and visualise.py

amorehead added 4 commits October 25, 2023 21:43

Begin building embedding visualisation script

2b9d033

Add placeholder UMAP code

dffcde2

Revert to hard-coding ent file type for ASTRAL, and add in tuple iter…

1c121c7

… fix to process dataset

Create first working prototype of UMAP embedding visualisation script

dd345c1

amorehead requested review from chaitjo and a-r-j October 27, 2023 02:34

Generify legend name

202525b

Clean up configs, add GPU support to embed.py and visualise.py, and f…

50ae9a0

…ix ndarray shape bug

amorehead added 4 commits October 27, 2023 17:06

Add ability to iterate overall selected splits, and fix up CLI suppor…

d52f8ad

…t for embed.py and visualise.py

Document embed.py and visualise.py, and add CLI support for them

88161b9

Further document

10d50a4

Add Dunn Index, and simplify UMAP clustering

3ba4d70

a-r-j merged commit f0fe4e7 into main Nov 3, 2023
1 of 10 checks passed

amorehead deleted the umap-viz branch November 3, 2023 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UMAP embedding visualisation script #55

Add UMAP embedding visualisation script #55

amorehead commented Oct 27, 2023 •

edited

Loading

amorehead commented Oct 27, 2023

amorehead commented Oct 27, 2023 •

edited

Loading

a-r-j commented Oct 27, 2023

a-r-j commented Oct 27, 2023

amorehead commented Oct 27, 2023 •

edited

Loading

Add UMAP embedding visualisation script #55

Add UMAP embedding visualisation script #55

Conversation

amorehead commented Oct 27, 2023 • edited Loading

amorehead commented Oct 27, 2023

amorehead commented Oct 27, 2023 • edited Loading

a-r-j commented Oct 27, 2023

a-r-j commented Oct 27, 2023

amorehead commented Oct 27, 2023 • edited Loading

amorehead commented Oct 27, 2023 •

edited

Loading

amorehead commented Oct 27, 2023 •

edited

Loading

amorehead commented Oct 27, 2023 •

edited

Loading