Skip to content

single-cell RNA seq Gaussian Mixture Model analysis pipeline. Accepts as input a single cell dataset (rows:cells x columns:gene expression) and provides their trajectories based on fitted Gaussian Mixture Models.

Notifications You must be signed in to change notification settings

KyriakosPsa/single-cell-ML-pipeline

Repository files navigation

scGMix a Pipeline for Single Cell Gaussian Mixture Models

scGmix is a tool written in Python and designed for intuitively discovering cell states from scRNA-seq datasets. The pipeline seamlessly integrates multiple functionalities, including data preprocessing with quality control and appropriate normalization, dimensionality reduction techniques such as principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), or uniform manifold approximation and projection (UMAP), and cell clustering using Gaussian Mixture Models (GMMs). GMM clustering can be performed using either pre-clustering component means computed through the tools offered by scGmix, or automatically precomputed component means using the integrated optuna optimization library. While some components of the pipeline require further tuning, scGmix achieves a balanced approach between automated processes, user preferences, and interpretability, thus we believe it is a valuable tool for users who wish to identify cell states based on their specific requirements.

A full technical report of the pipeline tested on synthetic single-cell datasets is available in Technical_Report.pdf.

Library Dependencies:

  • numpy
  • scanpy
  • anndata
  • matplotlib
  • seaborn
  • scikit-learn
  • kneed
  • pickle
  • optuna

File Dependancies:

  • ./utils/plotting.py
  • ./utils/optimization.py

Please make sure you have these dependencies installed before running the pipeline.

Pipeline overview

Alt Text

Usage

To use the scgmix pipeline, follow the steps below:

Import the necessary libraries:

import numpy as np
import scanpy as sc
import anndata as adata
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import SpectralClustering
from sklearn.mixture import GaussianMixture
from kneed import KneeLocator
import warnings
warnings.filterwarnings("ignore")
import pickle

Import the required file dependencies:

from utils.optimization import optimizeGMM, optimizeSpectral
from utils.plotting import plot_bic, make_ellipses_joint, posterior_heatmap, plot_state_cellsums, plot_pca

Instantiate an scgmix object and provide the required inputs:

pipeline = scgmix(adata, method="PCA", rand_seed=42)

Preprocess the data:

pipeline.preprocess(mads_away=5, feature_selection=False, min_mean=0.0125, max_mean=3, min_disp=0.5)

Perform dimensionality reduction:

pipeline.dimreduction(n_pcs=100, pc_selection_method="screeplot", n_neighbors=15, min_dist=0.1,
                      use_highly_variable=False, variance_threshold=90, verbose=True, plot_result=False)

Perform clustering:

pipeline.mix(preclustering_method="spectral", enable_preclustering=False, leiden_resolution=1.0,
             criterion="BIC", n_trials=100, verbose=True, max_iter=1000, max_num_components=5, user_means=None, show_progress_bar=True)

Additional Methods The scgmix class also provides additional utility methods:

pipeline.savefile(filenamepath): # Save the processed data to a file.
pipeline.savemodel(filenamepath): # Save the trained GMM model to a file.

About

single-cell RNA seq Gaussian Mixture Model analysis pipeline. Accepts as input a single cell dataset (rows:cells x columns:gene expression) and provides their trajectories based on fitted Gaussian Mixture Models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published