Currently we only support Unix based systems including MacOS.
- Instructions to setup clustal-omega can be found here.
- Clustal omega version supported
1.2.4
wget http://www.clustal.org/omega/clustal-omega-1.2.4.tar.gz
tar zxf clustal-omega-1.2.4.tar.gz
cd clustal-omega-1.2.4
./configure --prefix /your/install/path
make
make check # optional: run automated tests
make install # optional: install Infernal programs, man pages
# or use this
sudo apt-get install clustalo
- NOTE - Emboss is very slow, unless you are experimenting we don't recommend using it. Clustal Omega should be sufficient for most use cases.
- For setting up Emboss, please read here.
- Emboss version supported
6.6.0
- For infernal follow instructions here.
- Infernal version supported
1.1.5
wget http://eddylab.org/software/infernal/infernal.tar.gz
tar zxf infernal.tar.gz
cd infernal-1.1.5
./configure --prefix /your/install/path
make
make check # optional: run automated tests
make install # optional: install Infernal programs, man pages
# or use this
sudo apt-get install infernal infernal-doc
- To use this tool, you need to provide the Rfam covariance model which is available for download at https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz. It also needs to be modified using
cmpress
command fromInfernal
tool (mentioned above). If you don't have it then use the code below -
cd nucleoseeker
mkdir -p rfam
wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz -O Rfam.cm.gz
gunzip Rfam.cm.gz
cmpress Rfam.cm
# We recommend using a virtual env when using the this tool
git clone https://github.com/theuutkarsh/nucleoseeker.git
cd nucleoseeker
pip install -r requirements.txt
After you have prepared the environment you can generate datasets using the following code
export DATA_PATH=/your/desired/path/to/save/the/dataset
python3 src/dataset_creator.py \
--dataset_name test \
--rfam_cm_path your/rfam/path \
--exptl_method "X-RAY DIFFRACTION" \
--resolution 3.6 \
--year_range 2019 \
--save 1 \
After using this command, a directory with the name test
will be created in the DATA_PATH
directory with the following structure:
DATA_PATH
├── test_dataset
│ ├── files
│ ├── sequences
├──clean_tblout.tblout
├──cmscan.out
├──combined.fasta
├──fam_pdb_chain.csv
├──final.fasta
├──raw_experimental_RNA_0_500.csv
├──sequence_identity_mat_clustal.csv
├──tblout.tblout
-
raw_experimental_RNA_0_500.csv: Raw data from the PDB database.
-
combined.fasta: Sequences used in sequence identity calculation by Clustal Omega and Emboss, obtained after applying StructureLevelFilter and PDBFilter on the raw data.
-
sequence_identity_mat_clustal.csv: Sequence identity matrix obtained from Clustal Omega and Emboss tools.
-
final.fasta: Final sequences in fasta format; the output if family analysis is not required.
-
cmscan.out, tblout.tblout, clean_tblout.tblout: Output files of the Infernal tool.
-
fam_pdb_chain.csv: Mapping of family and PDB chain, obtained after family search by Infernal; the final output for family analysis.
-
test_dataset/files: Directory containing dataframes and lists for structures at each filter level.
-
test_dataset/sequences: Directory containing sequences for all final structures in individual fasta files.
For some simple examples, please take a look at the this notebook.