Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline fails when run with a lot of cores #763

Open
kdivilov opened this issue Jul 15, 2024 · 6 comments
Open

Pipeline fails when run with a lot of cores #763

kdivilov opened this issue Jul 15, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@kdivilov
Copy link

Description of the bug

I'm analyzing a 16S dataset with ~1,200 samples spread across 3 runs (unfortunately this dataset is not public yet so I can't provide a reproducible example). I've found a bug in the ampliseq pipeline where if I run the pipeline with 120 cores it will fail (specifically at the diversity step) but if I run it with 20 cores it will finish without any errors. I believe the issue is that the workflow gets ahead of itself due to the number of cores available and starts a qiime2 module before another requisite qiime2 module finishes.

Command used and terminal output

nextflow run nf-core/ampliseq -r 2.10.0 -profile singularity \
--input samplesheet.tsv \
--metadata metadata.tsv \
--outdir nfcore_ampliseq_GTDB \
--min_read_counts 1000 \
--ignore_empty_input_files \
--ignore_failed_trimming \
--ignore_failed_filtering \
--skip_cutadapt \
--trunclenf 200 \
--trunclenr 150 \
--vsearch_cluster \
--filter_ssu "bac" \
--exclude_taxa "mitochondria,chloroplast,archaea" \
--metadata_category_barplot "condition" \
--tax_agglom_max 7 \
--picrust \
--ancombc \
--dada_ref_taxonomy gtdb=R09-RS220 \
--dada_taxonomy_rc

Relevant files

No response

System information

Nextflow v23.10.1
nf-core/ampliseq v2.10.0
singularity v4.1.3
slurm v23.02.1

@kdivilov kdivilov added the bug Something isn't working label Jul 15, 2024
@d4straub
Copy link
Collaborator

Hi there, thanks for the report.
A few details are missing:
(1) How do you specify the number of cores, I dont see that in the command.
(2) What is the actual error message, for example the .nextflow.log of the failed run would include that

I believe the issue is that the workflow gets ahead of itself due to the number of cores available and starts a qiime2 module before another requisite qiime2 module finishes.

That should be impossible, if it were to happen it would be indeed a bug. You could also use in future a more up to date nextflow version. But I somehow doubt that this is the cause.

@d4straub
Copy link
Collaborator

Close due to lack of information.

@kdivilov
Copy link
Author

Sorry for the delay. I have attached the log file. I updated nextflow to v24.04.3 for this run. I specified the number of cores using the '-c' option in slurm's sbatch.

nf_log.txt

@d4straub
Copy link
Collaborator

Thanks! The error message in that log file is:

Jul-19 16:20:51.757 [TaskFinalizer-5] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_AMPLISEQ:AMPLISEQ:QIIME2_DIVERSITY:QIIME2_DIVERSITY_BETA (unweighted_unifrac_distance_matrix - HatcheryGut_Stayton_vs_HatcheryGut_Marion)'

Caused by:
  Process `NFCORE_AMPLISEQ:AMPLISEQ:QIIME2_DIVERSITY:QIIME2_DIVERSITY_BETA (unweighted_unifrac_distance_matrix - HatcheryGut_Stayton_vs_HatcheryGut_Marion)` terminated with an error exit status (1)


Command executed:

  export XDG_CONFIG_HOME="./xdgconfig"
  export MPLCONFIGDIR="./mplconfigdir"
  export NUMBA_CACHE_DIR="./numbacache"
  
  qiime diversity beta-group-significance \
      --i-distance-matrix unweighted_unifrac_distance_matrix.qza \
      --m-metadata-file metadata.tsv \
      --m-metadata-column "HatcheryGut_Stayton_vs_HatcheryGut_Marion" \
      --o-visualization unweighted_unifrac_distance_matrix-HatcheryGut_Stayton_vs_HatcheryGut_Marion.qzv \
      --p-pairwise
  qiime tools export \
      --input-path unweighted_unifrac_distance_matrix-HatcheryGut_Stayton_vs_HatcheryGut_Marion.qzv \
      --output-path beta_diversity/unweighted_unifrac_distance_matrix-HatcheryGut_Stayton_vs_HatcheryGut_Marion
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_AMPLISEQ:AMPLISEQ:QIIME2_DIVERSITY:QIIME2_DIVERSITY_BETA":
      qiime2: $( qiime --version | sed '1!d;s/.* //' )
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  QIIME is caching your current deployment for improved performance. This may take a few moments and should only happen once per deployment.
  Plugin error from diversity:
  
    [Errno 2] No such file or directory: '/tmp/qiime2/divilovk/processes/40-1721431222.49@divilovk/9b75043b-6c1f-454e-8320-ce1123afdd55.4883255971269328549/9b75043b-6c1f-454e-8320-ce1123afdd55' -> '/tmp/qiime2/divilovk/data/9b75043b-6c1f-454e-8320-ce1123afdd55'
  
  Debug info has been saved to /tmp/qiime2-q2cli-err-y03qaw_e.log

Work dir:
  /nfs6/core/scratch/divilovk/couch/microbiome/GTDB/work/ff/fcf14e1f706db6f3665c1098af479f

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

@d4straub d4straub reopened this Jul 23, 2024
@d4straub
Copy link
Collaborator

I am not sure what causes this error. It seems to me pretty likely that its a problem with the tmp dir (maybe full or was deleted at that specific timepoint coincidentally). My hypothesis is that it was an coincident that the job didnt finish when you specified 120 cpus and succeeded with 20 cpus (because QIIME2_DIVERSITY_BETA is running with 2 cores only by default). Even if you would have modified the cpus, my alternative hypothesis is that by doing so the job was started the second time on a different node that had a working tmp dir.
To test that hypothesis you would need to run the pipeline repeatedly with the high cores at different time points and potentially contact your sys admin about the tmp dir to make sure none of them is fills all available disc space. Let me know if after several attempts (use -resume, but avoid chaching by deleting the work dir of QIIME2_DIVERSITY_BETA every time or modify the config) the pipeline still fails with the high core number.

Next step might be: There is a discussion in the QIIME2 forum that might be related with some troubleshooting, see here.

@kdivilov
Copy link
Author

Changing the tmp dir to one that has 130 TB of free space produces the same error except that /tmp now points to the new tmp dir.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants