-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option for maximum average error rate (--max-aer) #725
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I’ll let this sit until you’re back to decide whether adding this is ok, but it looks useful to me.
If this is added, this also needs a documentation update and a changelog entry.
All done. Technically this is my day off, but I wanted to get some coding done for fun, so here you go. I don't know if I will respond to any re-review comments until tuesday. |
I am back. |
Ok, thinking about this, I’m in favor of the feature, but at the moment somewhat reluctant to add yet another command-line option. I’m thinking this should somehow reuse the Maybe you have a better idea? If not, I guess we’ll need to go with the extra option. |
Well it is new functionality. I do understand the reluctance. The truth is that cutadapt is very versatile and useful and an unmanageable amount of command line options is just a manifestation of that reality. See also GATK.
Well, I could make my own mean and lean tool that does just this and proliferate bioinformatics tools rather than command line options 😉. in fact I already have a tool. I am going to retire that once this feature is merged. I think the bioinformatics scene will benefit from having less tools rather than more. So I vote for making a tool obsolete if the cost is an extra command line option. I think cutadapt can survive an extra command line option, but it is your decision. |
Review comments have been addressed. |
I meant "how can this become part of Cutadapt without adding another command-line option to it". But it’s fine, let’s do this. |
Yes, I know. I also couldn't think of any better options. I should have stated that explicitly. Sorry. Thanks for merging! |
I did some tests on GM24385_1.fastq.gz which is one of the nanopore GIAB samples. It has about 3 million records ranging from 3bp to over a million bp.
--max-ee
is unsuitable for this use case. A max-ee of 3 would be absolute garbage for the 3bp reads and stellar never seen before quality for the 1 million bp reads.This pr adds a
--max-average-error-rate
/--max-aer
option to alleviate this problem. Using a--max-aer 0.1
option for instance to filter out the very worst long read sequencing reads seems like a decent practice.I will be off until coming tuesday, so I won't be able to respond as promptly as I usually do. I hope you will consider this PR as it will make using cutadapt for long-read sequencing much more viable.