Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastq filename convention and sample name parsing #182

Closed
janprovaz opened this issue Nov 9, 2020 · 4 comments · Fixed by #184
Closed

fastq filename convention and sample name parsing #182

janprovaz opened this issue Nov 9, 2020 · 4 comments · Fixed by #184
Assignees

Comments

@janprovaz
Copy link

Hi Everyone!

Sample name is parsed by splitting the filename on _.
This breaks when the sample names are different from xxxx_L001_R{1,2}_001fastq.gz or has an unexpected lack of _
In the documentation it is not obvious, that one should have the _L001_ in the filename.

I had samples named G0-3-2_R1_001.fastq.gz with the G0-3-2 being different sample names (only the numbers varied, single digits in all cases).

Running with default parameters:
./nextflow run nf-core/ampliseq -profile singularity --input reads --FW_primer GTGCCAGCMGCCGCGGTAA --RV_primer GGACTACHVGGGTWTCTAAT --metadata metadata.txt

Resulted in:

Error executing process > 'multiqc (1)'
Caused by:
 Process `multiqc` input file name collision -- There are multiple input files for each of the following file names: cutadapt/logs/cutadapt_log_.txt

notice the absence of the sample name in the file cutadapt_log_.txt

I renamed the input samples to:
G032_L001_R1_001.fastq.gz and it went through.
G0-3-2_L001_R1_001.fastq.gz also tested and also worked.

Proposed change:
Altering the documentation to reflect this, or changing the way the filenames are parsed in main.nf

Hope this helps 😄
Thank you for your amazing work!
All the best,
Jan

@erikrikarddaniel
Copy link
Member

This behaviour reflects how QIIME2 parses file names. For other file name patterns, you can use the --manifest option.

Documentation in ampliseq can probably be improved as you suggest.

@janprovaz
Copy link
Author

Thank you, Daniel! Perfect :) 👍

@d4straub
Copy link
Collaborator

d4straub commented Nov 12, 2020

Actually, I think it should work with --input. The problem seems that the name (not read file, but the "pair_id" determined by e.g. .fromFilePairs) is splitted by _ here but no _ is found in the case you mention and therefore no name is given. I am not sure why I never came across that case!

edit: _L001_ or similar is not required. The pipeline produces internally manifests to import read files to QIIME2 to make naming more flexible. Any _ should do the job. The PR below allows now improves that behavior and will work on the filenames you mention.

@d4straub
Copy link
Collaborator

Fixed in dev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants