Sequence models for protein classification

final course report: IDL_Project.pdf

Introduction

Data

Protein sequence and labels available on Kaggle dataset

dataset: https://www.kaggle.com/shahir/protein-data-set

Objective

Inferring the properties of a protein from its amino acid sequence is one of the key problems in bioinformatics.

We propose a method to classify 10 common proteins directly from their sequence of 20 amino acids

Model

We used a CNN & BiLSTM model in a similar way to what is done in sentiment analysis. The CNN is able to extract spacial features from an embedded sequence of proteins. A bidirectionnal LSTM is a powerful tool for sequence prediction and classification. A protein sequence has no predefinite order of lecture, that is why a bidirectional LSTM is prefered here. Both CNN and LSTM outputs are concatenated and passed through 2 fully connected layers to extract a final classication. Dropout is used between each sub-model blocks.

Training

We trained the model on the 10 most common proteins. Only proteins of length < 2000 were kept to tame long computational times. Gradient clipping was used to prevent exploding/vanishing gradients, a common problem for LSTM models.

GPU Used: Tesla P100 PCIe 16GB

Results

The model was evaluated on a test set of 3000 sequences.

Total test accuracy: 83%

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
imgs		imgs
IDL_Projet.pdf		IDL_Projet.pdf
ProtClass.ipynb		ProtClass.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sequence models for protein classification

Introduction

Data

Objective

Model

Training

Results

About

Releases

Packages

Languages

pablo-mas/protein-classification

Folders and files

Latest commit

History

Repository files navigation

Sequence models for protein classification

Introduction

Data

Objective

Model

Training

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages