Skip to content

Generate a rich feature space describing all possible mutations in a given protein sequence (+ structure)


Notifications You must be signed in to change notification settings


Repository files navigation


Deriving tabular form features for each possible mutation in a protein.


A container that takes a protein sequence [.fasta] (and optionally also a structure [.pdb]). Then returns a table with rows representing each possible AA mutation, and columns representing several unique quantitative approaches to describe of those mutations.

The idea is that this will be a useful tool for those looking to find patterns that distinguish resistance or disease causing mutations for example.

Features include:

  • Evolutionary: residue frequencies, site conservation, site-site co-evolution
  • Structural: disorder, solvent accessibility, secondary structure
  • Physicochemical: change in charge, hydrophobicity, VDW radius
  • Ligand: probability residue is in a pocket, is the residue contacting the most likely drug pocket
  • Language embedding of residue: Prot5

When provided only a sequence, only predicted structural features are generated.

When provided both a sequence and a pdb file, structural features derived from the structure will be appended (all residues required to be resolved).


For all intents and purposes one can replace "podman" with "docker" below.

To run the program you need a few things:

  • A linux environment with podman and ncbi-blast installed
  • The code for MutationFeatures & to be in that directory
    git clone
    cd MutationFeatures
  • Generate the subfolders ./db, ./query, ./temp
  • A blast database to mount in the container. MutaionFeatures currently requires uniref50
    mkdir ./db
    wget -P ./db
    gunzip uniref50.fasta.gz
    makeblastdb -in ./db/uniref50.fasta -parse_seqids -dbtype prot

Then you can build the container by running the command: podman build . -t mf

To run MutationFeatures against a query protein, drop a file say my.fasta in ./query, and optionally a file with the same basename such as my.pdb (ensure the PDB file contains all residues in the protein primary sequence, such as those produced by alphafold) Then run the following command:

podman run -e NVIDIA_VISIBLE_DEVICES=1 --rm -it --name mf \
    -v ./db:/db \
    -v ./lib:/mflibs \
    -v ./query:/query \
    -v ./temp:/tmp \
    mf /bin/bash \
    -c "Rscript /scripts/mf.R /query/my.fasta uniref50.fasta 32 1e-7" # query_fasta blast_db_name threads psiblast_eval

"-e NVIDIA_VISIBLE_DEVICES=1" is optional

The resultant csv file will be deposited in the same directory as your query FASTA file. This will contain a row for every possible mutaiton, and columns representing a featurespace suitable for Machine Learning.

Oscar J Charles 2022