-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using --revcomb on paired-end data #794
Comments
I recommend against treating R1 and R2 as individual single-end reads, especially when using
The swapping itself is mentioned in the section in the documentation, but maybe this should be clarified.
If no primer is present in a single-end read, it should be left unchanged by
Maybe my explanation above that
Yes, that happens because R1 and R2 are swapped.
That is a bug, thanks! I’ll look into this. |
Hello, I have a few questions and observations about the behaviour of the --revcomb flag for when cutting off primers from paired-end data.
Using a single-read approach on F and R files separately: If a read in which no primer sequence is present, not even a fragment, is returned as its reverse complement in the output - what is this decision based on, is the biological part of the sequence compared with the others in the file?
Using the --revcomp flag, different decisions are made about the same reads when using the single-read approach for F and R sequences separately and when using the paired-end-approach under the same -e and -O settings. I mean, in one case the read is reverse-complemented and marked as such, in the other case it isn't. What logic in how the algorithm works is this behaviour based on?
If a read is reverse-complemented in the output and flagged with "rc", sometimes the fasta header is additionally renamed such that in for example the F reads file, which consistently has the part "1:N:" in all of its headers, this part is changed to "2:N:" as it would be in the headers of the reverse reads - but sometimes this does not happen and only the "rc" flag is attached to the header, leaving the rest unchanged.
And a side note: when using the --revcomp flag on paired-end read data, the report that is printed doesn't give all the useful details that it gives for single reads, like how many times each primer fragment of each length exactly was removed and so on. It is reduced to only Total read pairs processed/written and Total basepairs processed/written. It would be very helpful to have the same statistics for paired-end data also :)
Our analysis doesn't depend on solving these questions, they are just observations we made while playing around with the many possibilities of cutadapt and we would like to share it with the community. I attach sample sequences with which all of the above can be reproduced (after renaming them to .fastq, which isn't allowed to be uploaded here). Below are the commands I used. Thank you very much for your consideration!!
This is cutadapt 4.8 with Python 3.11.3, installed with pip
paired-end approach:
cutadapt-venv/bin/cutadapt -g XGCGGTAATTCCAGCTCC -G XACTTTCGTTCTTGATYRR --revcomp -e 0.2 -O 6 -o sample_F_out_paired.fastq -p sample_R_out_paired.fastq sample_F.fastq sample_R.fastq
single-end apporoach:
cutadapt-venv/bin/cutadapt -g XGCGGTAATTCCAGCTCC --revcomp -e 0.2 -O 6 -o sample_F_out_single.fastq sample_F.fastq (to reproduce point 3, change to -e 0.3 and run on the F file)
cutadapt-venv/bin/cutadapt -g XACTTTCGTTCTTGATYRR --revcomp -e 0.2 -O 6 -o sample_R_out_single.fastq sample_R.fastq
sample_F.txt
sample_R.txt
The text was updated successfully, but these errors were encountered: