Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using --revcomb on paired-end data #794

Open
ThomePauline opened this issue Jul 11, 2024 · 1 comment
Open

Using --revcomb on paired-end data #794

ThomePauline opened this issue Jul 11, 2024 · 1 comment

Comments

@ThomePauline
Copy link

Hello, I have a few questions and observations about the behaviour of the --revcomb flag for when cutting off primers from paired-end data.

  1. Using a single-read approach on F and R files separately: If a read in which no primer sequence is present, not even a fragment, is returned as its reverse complement in the output - what is this decision based on, is the biological part of the sequence compared with the others in the file?

  2. Using the --revcomp flag, different decisions are made about the same reads when using the single-read approach for F and R sequences separately and when using the paired-end-approach under the same -e and -O settings. I mean, in one case the read is reverse-complemented and marked as such, in the other case it isn't. What logic in how the algorithm works is this behaviour based on?

  3. If a read is reverse-complemented in the output and flagged with "rc", sometimes the fasta header is additionally renamed such that in for example the F reads file, which consistently has the part "1:N:" in all of its headers, this part is changed to "2:N:" as it would be in the headers of the reverse reads - but sometimes this does not happen and only the "rc" flag is attached to the header, leaving the rest unchanged.

  4. And a side note: when using the --revcomp flag on paired-end read data, the report that is printed doesn't give all the useful details that it gives for single reads, like how many times each primer fragment of each length exactly was removed and so on. It is reduced to only Total read pairs processed/written and Total basepairs processed/written. It would be very helpful to have the same statistics for paired-end data also :)

Our analysis doesn't depend on solving these questions, they are just observations we made while playing around with the many possibilities of cutadapt and we would like to share it with the community. I attach sample sequences with which all of the above can be reproduced (after renaming them to .fastq, which isn't allowed to be uploaded here). Below are the commands I used. Thank you very much for your consideration!!

This is cutadapt 4.8 with Python 3.11.3, installed with pip

paired-end approach:
cutadapt-venv/bin/cutadapt -g XGCGGTAATTCCAGCTCC -G XACTTTCGTTCTTGATYRR --revcomp -e 0.2 -O 6 -o sample_F_out_paired.fastq -p sample_R_out_paired.fastq sample_F.fastq sample_R.fastq

single-end apporoach:
cutadapt-venv/bin/cutadapt -g XGCGGTAATTCCAGCTCC --revcomp -e 0.2 -O 6 -o sample_F_out_single.fastq sample_F.fastq (to reproduce point 3, change to -e 0.3 and run on the F file)
cutadapt-venv/bin/cutadapt -g XACTTTCGTTCTTGATYRR --revcomp -e 0.2 -O 6 -o sample_R_out_single.fastq sample_R.fastq

sample_F.txt
sample_R.txt

@marcelm
Copy link
Owner

marcelm commented Aug 10, 2024

Hello, I have a few questions and observations about the behaviour of the --revcomb flag for when cutting off primers from paired-end data.

  1. Using a single-read approach on F and R files separately:

I recommend against treating R1 and R2 as individual single-end reads, especially when using --revcomp as the behavior is totally different:

  • For single-end reads, --revcomp reverse-complements the sequence.
  • For paired-end reads, --revcomp swaps R1 and R2. The idea is that swapping the two ends reverse-complements the sequenced fragment (the one whose ends were observed in R1 and R2).

The swapping itself is mentioned in the section in the documentation, but maybe this should be clarified.

If a read in which no primer sequence is present, not even a fragment, is returned as its reverse complement in the output [...]

If no primer is present in a single-end read, it should be left unchanged by --revcomp. If not, that is a bug. I was not able to reproduce this with your examples.

  1. Using the --revcomp flag, different decisions are made about the same reads when using the single-read approach for F and R sequences separately and when using the paired-end-approach under the same -e and -O settings. I mean, in one case the read is reverse-complemented and marked as such, in the other case it isn't. What logic in how the algorithm works is this behaviour based on?

Maybe my explanation above that --revcomp only swaps R1/R2 explains what you are seeing?

  1. If a read is reverse-complemented in the output and flagged with "rc", sometimes the fasta header is additionally renamed such that in for example the F reads file, which consistently has the part "1:N:" in all of its headers, this part is changed to "2:N:" as it would be in the headers of the reverse reads - but sometimes this does not happen and only the "rc" flag is attached to the header, leaving the rest unchanged.

Yes, that happens because R1 and R2 are swapped.

  1. And a side note: when using the --revcomp flag on paired-end read data, the report that is printed doesn't give all the useful details that it gives for single reads, like how many times each primer fragment of each length exactly was removed and so on. It is reduced to only Total read pairs processed/written and Total basepairs processed/written. It would be very helpful to have the same statistics for paired-end data also :)

That is a bug, thanks! I’ll look into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants