Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcorr-bam - Very large phi_pool values from simulated data #10

Open
sid-krish opened this issue Nov 24, 2021 · 8 comments
Open

mcorr-bam - Very large phi_pool values from simulated data #10

sid-krish opened this issue Nov 24, 2021 · 8 comments

Comments

@sid-krish
Copy link

I am trying to analyse some simulated datasets but am getting very large results for phi_pool (i.e. 1e50).

I have some simulated datasets that I generated using the program fastsimbac that I would like to try with mcorr-bam. With the simulated datasets only the bam and reference fasta is available. For the gff I tried to simply point to the entire reference sequence that was used for alignment.

I have attached an example simulated bam, fasta and gff file where the phi_pool value is really large. The bam has been simulated with a population recombination rate (rho) 0.5, 50 genomes of size 10Kbp and population mutation rate (theta) of 0.1.

I used the following commands for mcorr-bam and mcorr-fit and have attached the bootstrapping_report file along with the other files generated by mcorr.

mcorr-bam test.gff Aligned.bam test_bam
mcorr-fit test_bam.csv test_bam

I was also wondering how phi_pool differs from the population recombination rate (rho)?
data.zip

@nhuanmc
Copy link

nhuanmc commented May 5, 2022

Same problem for me. However, my data is based on the real experiment with recombination of H. pylori. I used whole genome sequencing mapping to reference genome. I have 5 samples, one of them have phi_pool of value = 1.67191e+06. I ma not sure the reason why.

@mingleiR
Copy link

mingleiR commented Aug 8, 2022

Same problem for me. I use the mcorr to analyze the metagenomic reads using the custom pipeline (including align the reference genome against the read, filter the bam based on the criteria (NM < 5 & MapQ > 30) and sort the bam). The result from *_bootstrapping_report.txt" is as following:
"
[phi_pool]
value = 9.8242e+31
bootstrapping mean = 1.00873e+38
bootstrapping standard deviation = 2.39554e+38
bootstrapping size = 1000
"
Any suggestion will be appreciated.

Best,
Minglei

@jaresoles
Copy link

Hi,

Did you figure out the reason behind this?

@apsteinberg
Copy link
Collaborator

Hi folks,

First off, apologies to @sid-krish and @nhuanmc for our slow response to your questions. Looking at the data provided by @sid-krish (thanks for that!), it looks like the issue here is the same as that in issue #16 -- the data is being poorly fit by our coalescent model with recombination, which is resulting in physically unrealistic values for phi_pool.

For experimental data like @nhuanmc and @mingleiR, this means that there's either a lack of recombination or the data is too noisy to determine if recombination has taken place. I would recommend re-fitting it using "mcorrFitCompare" (see our ReadMe subsection 2-ii). This will fit the data both with the model with recombination and a model without recombination, and you can determine using AIC values whether or not your data is best fit with the recombination model (my guess is it's best fit by a coalescent model without recombination).

For more details, you can see the methods section of our new paper: https://elifesciences.org/articles/78533

For the simulations used by @sid-krish it looks like there just hasn't been sufficient time for enough diversity to build up for you to detect recombination in these simulation data. If you either increase the mutation rates, the simulation time, or the population size, you should see correlation profiles which are well fit by our recombination model.

Lastly, regarding @sid-krish about the population recombination rate (rho) vs phi_pool, you can find all the definitions of these terms in the supplement of our lab's nature methods paper describing mcorr (see here). phi_pool is: phi_pool = 2Tgamma, where T is the average coalescence time of a pair of sequences in the pool, and gamma is the recombination rate per basepair per generation. So it corresponds to the number of recombination events which have occurred since coalescence of the pool.

Happy to answer additional questions folks might have about this, and I hope that we can respond quicker to future queries!

Best,
Asher

@mingleiR
Copy link

For experimental data like @nhuanmc and @mingleiR, this means that there's either a lack of recombination or the data is too noisy to determine if recombination has taken place. I would recommend re-fitting it using "mcorrFitCompare" (see our ReadMe subsection 2-ii). This will fit the data both with the model with recombination and a model without recombination, and you can determine using AIC values whether or not your data is best fit with the recombination model (my guess is it's best fit by a coalescent model without recombination).

Hi Asher,

Thanks for your detailed explantion and helpful suggestion. I've checked the results of "mcorrFitCompare", and indeed found the AIC values of zero recombination model is a bit smaller than that of the recombination model. Because I have no prior knowledge about the recombination of the target species in the question, and not sure about my analytic pipeline. I'll scrutinize the pipeline using the published metagenomic datasets (in your work or other work citing your article) to validate it.

Thanks again.

Best,
Minglei

@apsteinberg
Copy link
Collaborator

Hi Minglei,

Sure thing! That sounds like a good plan. Keep us posted, I'm happy to help out in any way!

Best,
Asher

@mingleiR
Copy link

Hi Asher,

I re-analyze the gut microbiome metagenome in the paper (Lin et al, 2019), and found the statistic, such as gamma/mu is different from the result shown in the supplmentary (Table S6). The phi_pool value is relative smaller than my environmental samples (and not very large), but its bootstrapping mean and sd value is large, shown as below.

"
[phi_pool]
value = 0.147312
bootstrapping mean = 4.803e+15
bootstrapping standard deviation = 1.16911e+17
bootstrapping size = 1000
bootstrapping median = 0.142244
bootstrapping lower bound (5%) = 0.0577862
bootstrapping upper bound (95%) = 5.36008
"
My analyzing procedure was described as below
(1) downloading the reference genome and metagenome datasets
According to Table S6, I downloaded the genomic fasta sequence and GFF file of the reference genome (NC002695), and all metagenomic samples of infant 10131 (SRR605582, SRR605618, SRR924851, SRR925005, SRR925011, SRR925088). Note that the detailed accession number was not shown in the Table, and these SRR was retrieved from the original reference paper according to the ID of infant. The raw reads were trimmed the using trimmomatic with the minimum length 50 bp and a phred score threshold of 25 within a 4-bases sliding window.

(2) aligning the reads, merge the bam, and filter the alignment.
The paired reads are aligned against the reference using bowtie2 with default parameters. Then I filtered the mapped reads using "samtools view -F4", merged the bams from multiple runs together. Finally, the merged bam was converted into sam, and filter the reads with edit distance greater than 5 (i:NM:5) and the mapping quality (mapq) lower than 30 according to the criterion in the online method of the paper.

(3) estimating the relative recombination rates
Based on the filtered bam and the GFF of reference, the correlation profiles was calcuated using mcorr-bam, and the statistics, including the relative recombination rate was estimated using mcorr-fit.

Hope you could point the potential problem in the procedure above, which caused the difference in the summary statistics between both analyes.

Best,
Minglei

@apsteinberg
Copy link
Collaborator

Hi Minglei,

Apologies for the slow reply.

Regarding the bootstrapping, the reason that the mean and standard deviation are so high has to do with the boostrapping procedure. The way it works is it creates bootstrap replicates by randomly sampling with replacement from all the genes analyzed, then creates an average correlation profile over all those genes. Occasionally, one of the replicates is composed of many genes with very noisy correlation profiles. So what ends up happening is the model fits them poorly and yields physically unrealistic values for those parameters. Therefore, we typically take the median and the 95% CI from the bootstraps, but not the mean and standard deviation, which are often overweighted by these poor fits. In fact in the data you shared, both the median and bootstrap CI are a reasonable match for the parameter inferred from the actual data.

Regarding your pipeline, I have a few suggestions. It looks like from Mingzhi paper that he did two analyses from that patient (there are two rows listed as patient 10131). One was using this dataset and the other was using this one. The main differences I see in what you've done and what Mingzhi did to obtain his data are that it does not appear that he trimmed the reads for the infant data, I think this was just for the ancient samples. Then for step 2 (aligning the reads, merge the bam, and filter the alignment), he used the following steps:

bowtie2 --very-sensitive --no-unal -x $db -1 $sample1 -2 $sample2 -p $threads -S $samfile 
samtools view -h -o $bamfile $samfile 
rm $samfile
bamfilter --max-edit-distance 0 $bamfile $filteredbamfile 
samtools sort -n -o $sortnamefile $filteredbamfile 
bedtools bamtofastq -i $sortnamefile -fq $fastqfile1 -fq2 $fastqfile2

$db references the PanPhlAn database used (in this case it's for E coli). bamfilter is an in-house script which you can find here. bedtools can be found here.

After this step, the reads were mapped with smalt using the following commands:

smalt map -n $threads ${reference} $fastqfile1 $fastqfile2 | samtools view -h -F 4 - | samtools sort -@ $threads -o $sortedbamfile -

I am not sure where he specified mapping quality.

Step 3 looks to be the same.

I hope this helps and happy to keep trying to troubleshoot.

Best,
Asher

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants