Manual for using the Benchmark

Authors: Brian Luczak, Ben James, and Hani Girgis The Bioinformatics Toolsmith Laboratory, the University of Tulsa, OK, USA

This Alignment-Free Benchmarking Tool is designed to run on Mac and Linux systems.

Requirements

GNU g++ 7.1 or higher
MATLAB with Statistics and Machine Learning Toolbox R2017a or higher
Perl
BioPerl

After downloading and unzipping the Benchmark directory, there are 2 primary folders: Cpp (C++ code), and MATLAB.

The C++ code is used for creating the k-mer histograms for the sequences and align them using the Needleman-Wunsch alignment algorithm.
The MATLAB code is to load the alignment/histograms files and evaluate the statistics.
There are 3 Perl scripts.

Compilation and installation

git clone https://github.com/TulsaBioinformaticsToolsmith/Alignment-Free-Kmer-Statistics/
cd Alignment-Free-Kmer-Statistics
./install.pl
cd Cpp/Align
CXX=g++-7 make or just make if GNU g++ is the default compiler
cd ../Hist
CXX=g++-7 make or just make if GNU g++ is the default compiler

Testing the Benchmark with a given folder of sequences

Next, after obtaining a chosen selection of sequences in the FASTA format, the Perl script alignmentFree.pl will allow you to evaluate the same experiments on any selection of data.
Example: alignmentFree.pl ~/user/myFASTASeqfolder ~/user/Results
Notes:
- MATLAB should be a searchable application within the system, i.e. it can be executed from any directory without the full path.
- The output directory (second command line argument) will be deleted if it already exists and a new one will be created.
Steps of the Process:
1. Each of the sequences is aligned pairwise with every other sequence (no repetitions)
  - This establishes the benchmark for comparison
  - Due to the relative inefficiency of alignment, this can take quite a bit of time depending on the number of sequences and their length
    - The program will output the progress in .2% intervals
2. K-mer histograms are generated for k=1 through k=10, this will cover any sequences with lengths less than or equal to one million base pairs
3. MATLAB is opened and the EVALUATE function is called
  - This method procedurally picks an appropriate k-value based on the average sequence length
  - The files are converted into matrices using dlmread, delimiter is !, this delimiter should prevent any collisions with sequence names in the FASTA format
  - Warning: only1 million sequence comparisons are allowed
4. The REVIEW_PAPER function is called
  - Three primary experiments from the paper are run on the histogram and alignment data
    - The K Nearest Neighbor Experiment is repeated with mer-size minus 1 and minus 2 (as long as the initial kmer-size is big enough)
5. The results can be found in the designated output directory provided by the user -There are 3 folders provided in the results:
  - Align: contains the file with the alignment identity scores for the sequences provided
  - Hist: contains the histogram files, k-mer size 1 through 10 are provided
  - Matlab_EVAL: contains the complete MATLAB evaluation with figures
    - exp1 is the Sensitivity/Specificity experiment, it contains pdf versions of the figures as well as the Distribution of Alignment Identity score values by 5% bins
      - exp2 is the Linear Correlation experiment, the text files will allow you to see the specific R^2 values, Fig holds all the figures of the top performing statistics, the first number is the threshold percentage, 1-5 are the rankings
      - exp3 is the K Nearest Neighbor experiment

Replicating the Review Paper Results
- After opening the MATLAB code with your MATLAB (R2017a), simply run the following script in the command window:
```
> Review_Paper_Results('output_dir')
        
```
- Here, output_dir is the path to the output directory where the Results will be created.
-This will load the data from the workspace Example_Data.mat and run each of the experiments with the same parameters used in the paper.
MATLAB Conventions
- EVALUATE is the driver MATLAB Function that alignmentFree.pl calls
- REVIEW_PAPER is the primary MATLAB function that runs each of the experiments from the review paper
-All alignment identity score matrices have the numbers for the sequences being compared in cols 2 and 3.

-Example: see build_all_features 91.8976 2 4 in an Alignment matrix means that sequences 2 and 4 (rows 2 and 4) in the histogram matrix have an id score of 89.76%

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
S1_SourceCode		S1_SourceCode
README.org		README.org
S10_p27_K=5,s.txt		S10_p27_K=5,s.txt
S11_p27_K=5,p.txt		S11_p27_K=5,p.txt
S12_p27_K=10,s.txt		S12_p27_K=10,s.txt
S13_p27_K=10,p.txt		S13_p27_K=10,p.txt
S14_micorbiome_K=1,s.txt		S14_micorbiome_K=1,s.txt
S15_microbiome_K=1,p.txt		S15_microbiome_K=1,p.txt
S16_microbiome_K=5,s.txt		S16_microbiome_K=5,s.txt
S17_microbiome_K=5,p.txt		S17_microbiome_K=5,p.txt
S18_microbiome_K=10,s.txt		S18_microbiome_K=10,s.txt
S19_microbiome_K=10,p.txt		S19_microbiome_K=10,p.txt
S20_synthetic_K=1,s.txt		S20_synthetic_K=1,s.txt
S21_synthetic_K=1,p.txt		S21_synthetic_K=1,p.txt
S22_synthetic_K=5,s.txt		S22_synthetic_K=5,s.txt
S23_synthetic_K=5,p.txt		S23_synthetic_K=5,p.txt
S24_synthetic_K=10,s.txt		S24_synthetic_K=10,s.txt
S25_synthetic_K=10,p.txt		S25_synthetic_K=10,p.txt
S26_pacbio_K=1,s.txt		S26_pacbio_K=1,s.txt
S27_pacbio_K=1,p.txt		S27_pacbio_K=1,p.txt
S28_pacbio_K=5,s.txt		S28_pacbio_K=5,s.txt
S29_pacbio_K=5,p.txt		S29_pacbio_K=5,p.txt
S2_Specificity.pdf		S2_Specificity.pdf
S30_pacbio_K=10,s.txt		S30_pacbio_K=10,s.txt
S31_pacbio_K=10,p.txt		S31_pacbio_K=10,p.txt
S32_microbiome_kmer=3,K=5,p.txt		S32_microbiome_kmer=3,K=5,p.txt
S33_microbiome_kmer=3,K=10,s.txt		S33_microbiome_kmer=3,K=10,s.txt
S34_microbiome_kmer=3,K=10,p.txt		S34_microbiome_kmer=3,K=10,p.txt
S35_microbiome_kmer=3,K=1,s.txt		S35_microbiome_kmer=3,K=1,s.txt
S36_microbiome_kmer=3,K=1,p.txt		S36_microbiome_kmer=3,K=1,p.txt
S37_microbiome_kmer=3,K=5,s.txt		S37_microbiome_kmer=3,K=5,s.txt
S38_microbiome_kmer=2,K=1,s.txt		S38_microbiome_kmer=2,K=1,s.txt
S39_microbiome_kmer=2,K=1,p.txt		S39_microbiome_kmer=2,K=1,p.txt
S3_Rsquared_all.txt		S3_Rsquared_all.txt
S40_microbiome_kmer=2,K=5,s.txt		S40_microbiome_kmer=2,K=5,s.txt
S41_microbiome_kmer=2,K=5,p.txt		S41_microbiome_kmer=2,K=5,p.txt
S42_microbiome_kmer=2,K=10,s.txt		S42_microbiome_kmer=2,K=10,s.txt
S43_microbiome_kmer=2,K=10,p.txt		S43_microbiome_kmer=2,K=10,p.txt
S44_microbiome_kmer=5,K=1,s.txt		S44_microbiome_kmer=5,K=1,s.txt
S45_microbiome_kmer=5,K=1,p.txt		S45_microbiome_kmer=5,K=1,p.txt
S46_microbiome_kmer=5,K=5,s.txt		S46_microbiome_kmer=5,K=5,s.txt
S47_microbiome_kmer=5,K=5,p.txt		S47_microbiome_kmer=5,K=5,p.txt
S48_microbiome_kmer=5,K=10,s.txt		S48_microbiome_kmer=5,K=10,s.txt
S49_microbiome_kmer=5,K=10,p.txt		S49_microbiome_kmer=5,K=10,p.txt
S4_Rsquared_60.txt		S4_Rsquared_60.txt
S50_microbiome_kmer=6,K=1,s.txt		S50_microbiome_kmer=6,K=1,s.txt
S51_microbiome_kmer=6,K=1,p.txt		S51_microbiome_kmer=6,K=1,p.txt
S52_microbiome_kmer=6,K=5,s.txt		S52_microbiome_kmer=6,K=5,s.txt
S53_microbiome_kmer=6,K=5,p.txt		S53_microbiome_kmer=6,K=5,p.txt
S54_microbiome_kmer=6,K=10,s.txt		S54_microbiome_kmer=6,K=10,s.txt
S55_microbiome_kmer=6,K=10,p.txt		S55_microbiome_kmer=6,K=10,p.txt
S56_local_K=1,s.txt		S56_local_K=1,s.txt
S57_local_K=1,p.txt		S57_local_K=1,p.txt
S58_local_K=5,s.txt		S58_local_K=5,s.txt
S59_local_K=5,p.txt		S59_local_K=5,p.txt
S5_Rsquared_70.txt		S5_Rsquared_70.txt
S60_local_K=10,s.txt		S60_local_K=10,s.txt
S61_local_K=10,p.txt		S61_local_K=10,p.txt
S6_Rsquared_80.txt		S6_Rsquared_80.txt
S7_Rsquared_90.txt		S7_Rsquared_90.txt
S8_p27_K=1,s.txt		S8_p27_K=1,s.txt
S9_p27_K=1,p.txt		S9_p27_K=1,p.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Manual for using the Benchmark

Requirements

Compilation and installation

Testing the Benchmark with a given folder of sequences

About

Releases

Packages

Languages

BioinformaticsToolsmith/Alignment-Free-Kmer-Statistics

Folders and files

Latest commit

History

Repository files navigation

Manual for using the Benchmark

Requirements

Compilation and installation

Testing the Benchmark with a given folder of sequences

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages