Skip to content

NucleoSeeker - Precision filtering of RNA databases to curate high-quality datasets

License

Notifications You must be signed in to change notification settings

theuutkarsh/nucleoseeker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NucleoSeeker - Precision filtering of RNA databases to curate high-quality datasets

Currently we only support Unix based systems including MacOS.

Setup Steps

Get Clustal Omega ready

  • Instructions to setup clustal-omega can be found here.
  • Clustal omega version supported 1.2.4
  wget http://www.clustal.org/omega/clustal-omega-1.2.4.tar.gz
  tar zxf clustal-omega-1.2.4.tar.gz
  cd clustal-omega-1.2.4
  ./configure --prefix /your/install/path
  make
  make check                 # optional: run automated tests
  make install               # optional: install Infernal programs, man pages

  # or use this

  sudo apt-get install clustalo

Get Emboss ready (Optional)

  • NOTE - Emboss is very slow, unless you are experimenting we don't recommend using it. Clustal Omega should be sufficient for most use cases.
  • For setting up Emboss, please read here.
  • Emboss version supported 6.6.0

Get Infernal ready

  • For infernal follow instructions here.
  • Infernal version supported 1.1.5
  wget http://eddylab.org/software/infernal/infernal.tar.gz
  tar zxf infernal.tar.gz
  cd infernal-1.1.5
  ./configure --prefix /your/install/path
  make
  make check                 # optional: run automated tests
  make install               # optional: install Infernal programs, man pages

  # or use this

  sudo apt-get install infernal infernal-doc

Get Rfam.cm file ready

  • To use this tool, you need to provide the Rfam covariance model which is available for download at https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz. It also needs to be modified using cmpress command from Infernal tool (mentioned above). If you don't have it then use the code below -
  cd nucleoseeker
  mkdir -p rfam
  wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz -O Rfam.cm.gz
  gunzip Rfam.cm.gz
  cmpress Rfam.cm
  

Generating new dataset

# We recommend using a virtual env when using the this tool

git clone https://github.com/theuutkarsh/nucleoseeker.git
cd nucleoseeker
pip install -r requirements.txt

After you have prepared the environment you can generate datasets using the following code

export DATA_PATH=/your/desired/path/to/save/the/dataset
python3 src/dataset_creator.py \
        --dataset_name test \
        --rfam_cm_path your/rfam/path \
        --exptl_method "X-RAY DIFFRACTION" \
        --resolution 3.6 \
        --year_range 2019 \
        --save 1 \

After using this command, a directory with the name test will be created in the DATA_PATH directory with the following structure:

      DATA_PATH
      ├── test_dataset
      │   ├── files
      │   ├── sequences
      ├──clean_tblout.tblout
      ├──cmscan.out
      ├──combined.fasta
      ├──fam_pdb_chain.csv
      ├──final.fasta
      ├──raw_experimental_RNA_0_500.csv
      ├──sequence_identity_mat_clustal.csv
      ├──tblout.tblout

  • raw_experimental_RNA_0_500.csv: Raw data from the PDB database.

  • combined.fasta: Sequences used in sequence identity calculation by Clustal Omega and Emboss, obtained after applying StructureLevelFilter and PDBFilter on the raw data.

  • sequence_identity_mat_clustal.csv: Sequence identity matrix obtained from Clustal Omega and Emboss tools.

  • final.fasta: Final sequences in fasta format; the output if family analysis is not required.

  • cmscan.out, tblout.tblout, clean_tblout.tblout: Output files of the Infernal tool.

  • fam_pdb_chain.csv: Mapping of family and PDB chain, obtained after family search by Infernal; the final output for family analysis.

  • test_dataset/files: Directory containing dataframes and lists for structures at each filter level.

  • test_dataset/sequences: Directory containing sequences for all final structures in individual fasta files.

For some simple examples, please take a look at the this notebook.

About

NucleoSeeker - Precision filtering of RNA databases to curate high-quality datasets

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages