Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with ambiguous bases in reads #294

Closed
mbhall88 opened this issue Sep 6, 2022 · 6 comments · Fixed by #295
Closed

How to deal with ambiguous bases in reads #294

mbhall88 opened this issue Sep 6, 2022 · 6 comments · Fixed by #295

Comments

@mbhall88
Copy link
Member

mbhall88 commented Sep 6, 2022

I have a fastq that gets zero coverage on all PRGs - which is very weird as there are lots of reads.

Looking at the log, pretty much every read has this warning

bad letter - found a non AGCT base in read so skipping read SRR1723506.87

Looking at the fastq, indeed, nearly every read has at least one N in it.

Is there a way we can deal with these?

For context, I came across this while testing drprg. mykrobe and tbprofiler make the correct predictions for this sample, but pandora doesn't even produce a genotyped VCF as it filters out all PRGs as they all fail filtering.

How does cortex deal with this sort of thing @iqbal-lab ?

The fastq is at /hps/nobackup/iqbal/mbhall/drprg/paper/results/filtered/illumina/PRJNA268900/SAMN03246262/SRR1723506/SRR1723506.filtered.fq.gz

@iqbal-lab
Copy link
Collaborator

Cortex effectively cuts a read at Ns. So here it would basically trim the N off the read, but still work on the rest. Pandora ought to be able to do this really

@leoisl
Copy link
Collaborator

leoisl commented Sep 6, 2022

One option is the one Zam mentioned: we simply trim the N off. It is likely the easiest solution to implement: we always process all reads, but we skip minimisers containing N. Another option is to translate the minimisers until a certain number of Ns, let's say 2. If the minimiser looks like this:

ACCCGNGTTTCNA

We could translate it into the 8 possible other minimisers, and map into the graph, generating one true hit and 7 other possible FP hits (FP only if they hit a PRG, otherwise no hit at all). With very high likelihood only the true minimizer will be kept after filtering the hits, as I think the spurious hits will be by random chance and very likely isolated hits. I propose we limit the number of Ns we translate to a small amount (e.g. 2 or 3) because a minimiser with lots of Ns, e.g.: NNNNNAANNNNNN does not bring any information really, just noise, and we can safely discard it, and it also improves performance.

So I think we have two options we could choose from, or maybe a third if you propose. Implementation-wise, I think both options are easy to implement. The second one has this translation step, but it is not hard to implement. We actually do it in make_prg already, but we translate only RYKMSW.

@mbhall88
Copy link
Member Author

mbhall88 commented Sep 6, 2022

I don't think the translate into all possible bases is a good idea, as this might confuse counts on "fake" alternate alleles. And if you have a file like this one where nearly all reads have N's you might get a weird scenario where you get an equal number of hits across all bases?

Maybe the minimap2 approach is best? We just skip minimizers with N in them. Rather than skipping the whole read?

@leoisl
Copy link
Collaborator

leoisl commented Sep 7, 2022

True, agree with skipping minimizers with N in them as the best strategy!

@iqbal-lab
Copy link
Collaborator

yes, skip minimizers with Ns.

@mbhall88 mbhall88 linked a pull request Sep 12, 2022 that will close this issue
@leoisl
Copy link
Collaborator

leoisl commented Sep 27, 2022

Closed via #295

@leoisl leoisl closed this as completed Sep 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants