Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UMAP embedding visualisation script #55

Merged
merged 10 commits into from
Nov 3, 2023
Merged

Add UMAP embedding visualisation script #55

merged 10 commits into from
Nov 3, 2023

Conversation

amorehead
Copy link
Collaborator

@amorehead amorehead commented Oct 27, 2023

  • Adds a visualise.py script that lets one load any pre-trained encoder checkpoint, embed an entire dataset as a collection of graph embeddings, and then plot the collected embeddings using a Gaussian flavour of UMAP.
  • Lists the top 20 most common label string values in the legend of the resulting figure. For example, when running this script with the settings python proteinworkshop/visualise.py ckpt_path=$MY_PT_CA_BB_CKPT_PATH plot_filepath=output_visualisations/pt_encoder_fold_superfamily.png dataset=fold_superfamily encoder=gcpnet features=ca_bb task=multiclass_graph_classification, this yields the top 20 most common superfamilies being identified in the figure's legend.
  • Plots may be adjusted as desired by the user. The core logic for acquiring dataset graph embeddings is provided (i.e., batteries included).

@amorehead
Copy link
Collaborator Author

I am currently facing an odd indexing error when I run for the full fold_superfamily dataset. On a subset of the dataset, I am seeing clusterings such as the following, for reference.

structure_denoising_pretrained_gcpnet_fold_superfamily

@amorehead
Copy link
Collaborator Author

amorehead commented Oct 27, 2023

A note for later: One idea for quantifying each encoder's clustering ability would be to report each method's Dunn Index and Silhouette value.

@a-r-j
Copy link
Owner

a-r-j commented Oct 27, 2023

LGTM so far. I'll add a property to the datamodules for mapping labels to their corresponding text where possible.

@a-r-j
Copy link
Owner

a-r-j commented Oct 27, 2023

How are devices handled? Normally the L.Trainer would handle this but we're not using that here. I suppose this is all happening on the CPU?

@amorehead
Copy link
Collaborator Author

amorehead commented Oct 27, 2023

Great point. I completely missed that the trainer wasn't being instantiated here. No wonder my initial runs of the script were so slow. In this commit, I've added GPU support for both embed.py and visualise.py. Moreover, after timing visualise.py with GCPNet on the fold_superfamily dataset, I was able to create the plot below in around 10 minutes using a GPU.

structure_denoising_pretrained_gcpnet_fold_superfamily

Overall, it looks like the b.1 superfamily (i.e., the most common superfamily in the dataset) is clustered pretty well.

@a-r-j a-r-j merged commit f0fe4e7 into main Nov 3, 2023
1 of 10 checks passed
@amorehead amorehead deleted the umap-viz branch November 3, 2023 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants