Using --revcomb on paired-end data #794

ThomePauline · 2024-07-11T07:18:28Z

Hello, I have a few questions and observations about the behaviour of the --revcomb flag for when cutting off primers from paired-end data.

Using a single-read approach on F and R files separately: If a read in which no primer sequence is present, not even a fragment, is returned as its reverse complement in the output - what is this decision based on, is the biological part of the sequence compared with the others in the file?
Using the --revcomp flag, different decisions are made about the same reads when using the single-read approach for F and R sequences separately and when using the paired-end-approach under the same -e and -O settings. I mean, in one case the read is reverse-complemented and marked as such, in the other case it isn't. What logic in how the algorithm works is this behaviour based on?
If a read is reverse-complemented in the output and flagged with "rc", sometimes the fasta header is additionally renamed such that in for example the F reads file, which consistently has the part "1:N:" in all of its headers, this part is changed to "2:N:" as it would be in the headers of the reverse reads - but sometimes this does not happen and only the "rc" flag is attached to the header, leaving the rest unchanged.
And a side note: when using the --revcomp flag on paired-end read data, the report that is printed doesn't give all the useful details that it gives for single reads, like how many times each primer fragment of each length exactly was removed and so on. It is reduced to only Total read pairs processed/written and Total basepairs processed/written. It would be very helpful to have the same statistics for paired-end data also :)

Our analysis doesn't depend on solving these questions, they are just observations we made while playing around with the many possibilities of cutadapt and we would like to share it with the community. I attach sample sequences with which all of the above can be reproduced (after renaming them to .fastq, which isn't allowed to be uploaded here). Below are the commands I used. Thank you very much for your consideration!!

This is cutadapt 4.8 with Python 3.11.3, installed with pip

paired-end approach:
cutadapt-venv/bin/cutadapt -g XGCGGTAATTCCAGCTCC -G XACTTTCGTTCTTGATYRR --revcomp -e 0.2 -O 6 -o sample_F_out_paired.fastq -p sample_R_out_paired.fastq sample_F.fastq sample_R.fastq

single-end apporoach:
cutadapt-venv/bin/cutadapt -g XGCGGTAATTCCAGCTCC --revcomp -e 0.2 -O 6 -o sample_F_out_single.fastq sample_F.fastq (to reproduce point 3, change to -e 0.3 and run on the F file)
cutadapt-venv/bin/cutadapt -g XACTTTCGTTCTTGATYRR --revcomp -e 0.2 -O 6 -o sample_R_out_single.fastq sample_R.fastq

sample_F.txt
sample_R.txt

marcelm · 2024-08-10T06:36:48Z

Hello, I have a few questions and observations about the behaviour of the --revcomb flag for when cutting off primers from paired-end data.

Using a single-read approach on F and R files separately:

I recommend against treating R1 and R2 as individual single-end reads, especially when using --revcomp as the behavior is totally different:

For single-end reads, --revcomp reverse-complements the sequence.
For paired-end reads, --revcomp swaps R1 and R2. The idea is that swapping the two ends reverse-complements the sequenced fragment (the one whose ends were observed in R1 and R2).

The swapping itself is mentioned in the section in the documentation, but maybe this should be clarified.

If a read in which no primer sequence is present, not even a fragment, is returned as its reverse complement in the output [...]

If no primer is present in a single-end read, it should be left unchanged by --revcomp. If not, that is a bug. I was not able to reproduce this with your examples.

Using the --revcomp flag, different decisions are made about the same reads when using the single-read approach for F and R sequences separately and when using the paired-end-approach under the same -e and -O settings. I mean, in one case the read is reverse-complemented and marked as such, in the other case it isn't. What logic in how the algorithm works is this behaviour based on?

Maybe my explanation above that --revcomp only swaps R1/R2 explains what you are seeing?

If a read is reverse-complemented in the output and flagged with "rc", sometimes the fasta header is additionally renamed such that in for example the F reads file, which consistently has the part "1:N:" in all of its headers, this part is changed to "2:N:" as it would be in the headers of the reverse reads - but sometimes this does not happen and only the "rc" flag is attached to the header, leaving the rest unchanged.

Yes, that happens because R1 and R2 are swapped.

And a side note: when using the --revcomp flag on paired-end read data, the report that is printed doesn't give all the useful details that it gives for single reads, like how many times each primer fragment of each length exactly was removed and so on. It is reduced to only Total read pairs processed/written and Total basepairs processed/written. It would be very helpful to have the same statistics for paired-end data also :)

That is a bug, thanks! I’ll look into this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using --revcomb on paired-end data #794

Using --revcomb on paired-end data #794

ThomePauline commented Jul 11, 2024

marcelm commented Aug 10, 2024

Using --revcomb on paired-end data #794

Using --revcomb on paired-end data #794

Comments

ThomePauline commented Jul 11, 2024

marcelm commented Aug 10, 2024