NucleoSeeker - Precision filtering of RNA databases to curate high-quality datasets

Currently we only support Unix based systems including MacOS.

Setup Steps

Get Clustal Omega ready

Instructions to setup clustal-omega can be found here.
Clustal omega version supported 1.2.4

  wget http://www.clustal.org/omega/clustal-omega-1.2.4.tar.gz
  tar zxf clustal-omega-1.2.4.tar.gz
  cd clustal-omega-1.2.4
  ./configure --prefix /your/install/path
  make
  make check                 # optional: run automated tests
  make install               # optional: install Infernal programs, man pages

  # or use this

  sudo apt-get install clustalo

Get Emboss ready (Optional)

NOTE - Emboss is very slow, unless you are experimenting we don't recommend using it. Clustal Omega should be sufficient for most use cases.
For setting up Emboss, please read here.
Emboss version supported 6.6.0

Get Infernal ready

For infernal follow instructions here.
Infernal version supported 1.1.5

  wget http://eddylab.org/software/infernal/infernal.tar.gz
  tar zxf infernal.tar.gz
  cd infernal-1.1.5
  ./configure --prefix /your/install/path
  make
  make check                 # optional: run automated tests
  make install               # optional: install Infernal programs, man pages

  # or use this

  sudo apt-get install infernal infernal-doc

Get Rfam.cm file ready

To use this tool, you need to provide the Rfam covariance model which is available for download at https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz. It also needs to be modified using cmpress command from Infernal tool (mentioned above). If you don't have it then use the code below -

  cd nucleoseeker
  mkdir -p rfam
  wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz -O Rfam.cm.gz
  gunzip Rfam.cm.gz
  cmpress Rfam.cm

Generating new dataset

# We recommend using a virtual env when using the this tool

git clone https://github.com/theuutkarsh/nucleoseeker.git
cd nucleoseeker
pip install -r requirements.txt

After you have prepared the environment you can generate datasets using the following code

export DATA_PATH=/your/desired/path/to/save/the/dataset
python3 src/dataset_creator.py \
        --dataset_name test \
        --rfam_cm_path your/rfam/path \
        --exptl_method "X-RAY DIFFRACTION" \
        --resolution 3.6 \
        --year_range 2019 \
        --save 1 \

After using this command, a directory with the name test will be created in the DATA_PATH directory with the following structure:

      DATA_PATH
      ├── test_dataset
      │   ├── files
      │   ├── sequences
      ├──clean_tblout.tblout
      ├──cmscan.out
      ├──combined.fasta
      ├──fam_pdb_chain.csv
      ├──final.fasta
      ├──raw_experimental_RNA_0_500.csv
      ├──sequence_identity_mat_clustal.csv
      ├──tblout.tblout

raw_experimental_RNA_0_500.csv: Raw data from the PDB database.
combined.fasta: Sequences used in sequence identity calculation by Clustal Omega and Emboss, obtained after applying StructureLevelFilter and PDBFilter on the raw data.
sequence_identity_mat_clustal.csv: Sequence identity matrix obtained from Clustal Omega and Emboss tools.
final.fasta: Final sequences in fasta format; the output if family analysis is not required.
cmscan.out, tblout.tblout, clean_tblout.tblout: Output files of the Infernal tool.
fam_pdb_chain.csv: Mapping of family and PDB chain, obtained after family search by Infernal; the final output for family analysis.
test_dataset/files: Directory containing dataframes and lists for structures at each filter level.
test_dataset/sequences: Directory containing sequences for all final structures in individual fasta files.

For some simple examples, please take a look at the this notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
examples		examples
src		src
.gitignore		.gitignore
LICENSE		LICENSE
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NucleoSeeker - Precision filtering of RNA databases to curate high-quality datasets

Setup Steps

Get Clustal Omega ready

Get Emboss ready (Optional)

Get Infernal ready

Get Rfam.cm file ready

Generating new dataset

About

Releases

Packages

Languages

License

theuutkarsh/nucleoseeker

Folders and files

Latest commit

History

Repository files navigation

NucleoSeeker - Precision filtering of RNA databases to curate high-quality datasets

Setup Steps

Get Clustal Omega ready

Get Emboss ready (Optional)

Get Infernal ready

Get Rfam.cm file ready

Generating new dataset

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages