Skip to content

Latest commit

 

History

History
93 lines (63 loc) · 5.6 KB

4.divergence_time_inference.md

File metadata and controls

93 lines (63 loc) · 5.6 KB

Bayesian divergence time estimation with MCMCtree

Here, we will estimate divergence times for the Lepus species tree using MCMCtree as implemented in PAML. The input for this analysis will be a concatenated alignment of the 1st, 2nd and 3rd codon positions (i.e., the alignment has three partitions) of all coding sequences contained in the capture alignment.

First, I retrieved all gene annotations overlapping with our targets. Convert a 0-based bed file with the coordinates of all the targets to a 1-based coordinates file.

awk '{OFS=":"; print $1,$2+1,$3}' 141216_Lepus_Ex1_MJ_EZ_HX1_capture_targets.bed > 141216_Lepus_Ex1_MJ_EZ_HX1_capture_targets.regions

Then, I used R package biomaRt to retrieve the list of transcript IDs that correspond to the longest transcript per gene in our capture design. This list will be written to the file Lepus_Ex1_MJ_EZ_HX1_capture_targets.longest_transcript.list.

Rscript retrieve_longest_transcript.R 

Using the list of transcripts, I extracted their gff3 information from the European rabbit reference. I downloaded the gff3 rabbit file from ENSEMBL.

while read -r LINE; do grep $LINE ../Oryctolagus_cuniculus.OryCun2.0.94_CDS.gff3 > "$LINE".gff3; done < ../Lepus_Ex1_MJ_EZ_HX1_capture_targets.longest_transcript.list

Now, let's create a folder for each transcript:

parallel 'mkdir {1}' ::: Lepus_Ex1_MJ_EZ_HX1_capture_targets.longest_transcript.list

... And find the scripts that are in reverse or forward strand.

for i in $(ls *.gff3); do echo ${i/.gff3/} $(cut -f7 $i | head -n1 -q ); done | grep '-' > reverse_strand_gff3.txt

for i in $(ls *.gff3); do echo ${i/.gff3/} $(cut -f7 $i | head -n1 -q ); done | grep '+' > forward_strand_gff3.txt

Then run the script create_CDS_alignment.sh twice, one time for transcripts in the reverse strang and another time for transcripts in forward strands. To run the script on reverse transcripts, hashtag out the section in the script that calls reverse_concatenate_reverse_CDS.py.

create_CDS_alignment.sh requires:

  1. Whole chromosome consensus fasta sequences for each individual generated previously. Alter the paths to access them and maintain a samples.txt file with sample IDs in your folder. Here I only used one individual per species.
  2. reverse_concatenate_reverse_CDS.py (concatenates exons in reverse order);
  3. bedtools to obtain fasta sequences from a gff3 file;
  4. AMAS to convert fasta alignments to phylip format, as required by MCMCtree;
  5. faidx to retrieve the rabbit chromosome as outgroup.

Then run create_CDS_alignment.sh. You need to provide the trancript ID and chromosome:

parallel -j 10 --colsep="\t" 'create_CDS_alignments.sh {1} $(head -n1 {1}.gff3 | cut -f1 )' :::: ../Lepus_Ex1_MJ_EZ_HX1_capture_targets.longest_transcript.list &

Run AMAS to obtain various kinds of information about each alignment.

parallel -j 10 'i={1}; AMAS summary -i $i -f phylip-int -d dna -o ${i/.phy/.summary.txt}' ::: $(ls *.CDS_algn.phy)

Based on the report from AMAS, I picked alignments with less than 20% missing data for the analysis. So, from these alignments only, I extracted the 1st, 2nd and 3rd codons using a custom script extract_123_CDS_algn.phy.

parallel -j 10 'i={1}; ./extract_123_CDS_algn.phy $i ${i/.CDS_algn.phy/}' ::: $(ls *.CDS_algn.phy)  &

Concatenate all the alignments:

AMAS concat --in-files *.CDS_1st_algn.fa -f fasta -d dna -u fasta -y nexus -t 9k_CDS_1st_alignments.fa -p 9k_CDS_1st_alignments_partitions.txt
AMAS concat --in-files *.CDS_2nd_algn.fa -f fasta -d dna -u fasta -y nexus -t 9k_CDS_2nd_alignments.fa -p 9k_CDS_2nd_alignments_partitions.txt
AMAS concat --in-files *.CDS_3rd_algn.fa -f fasta -d dna -u fasta -y nexus -t 9k_CDS_3rd_alignments.fa -p 9k_CDS_3rd_alignments_partitions.txt

Convert them to phylip with AMAS:

AMAS convert -i 9k_CDS_1st_alignments.fa -f fasta -u phylip -d dna
AMAS convert -i 9k_CDS_2nd_alignments.fa -f fasta -u phylip -d dna
AMAS convert -i 9k_CDS_3rd_alignments.fa -f fasta -u phylip -d dna

And then also concatenate all three partitions.

cat 9k_CDS_1st_alignments.fa-out.phy 9k_CDS_2nd_alignments.fa-out.phy 9k_CDS_3rd_alignments.fa-out.phy > 9k_CDS_123_alignments.phy 

Used 9k_CDS_123_alignments.phy as input for MCMCtree.

I ran MCMCtree using an Approximate Maximum Likelihood method (dos Reis and Yang 2019) as described in this tutorial. As described in the tutorial, there are two steps:

"In the first step the branch lengths are estimated by maximum likelihood, together with the gradient and Hessian (i.e. vector of first derivatives and matrix of second derivatives) of the likelihood function at the maximum likelihood estimates. The gradient and Hessian contain information about the curvature of the likelihood surface. In the second step, estimation of divergence times proceeds using MCMC, but using the gradient and Hessian to construct an approximation to the likelihood function by Taylor expansion."

You can find in this folder an example control file for the Hessian analysis mcmc_Hessian.ctl and for the second step mcmc_Approx.ctl. We ran the second step twice and checked for convergence using code in mcmc_convergence.R.

Go back to main page