feat: deduplicate with UMIs #1358

mathiasbio · 2024-01-16T16:45:44Z

This PR:

Makes some significant changes to the TGA workflow by aiming to use the UMIs during the deduplication stage. This is a feature that only exists on the later versions of Sentieon.

The PR is tested on Sentieon version: sentieon-genomics-202308.03 which has this feature.

As this new feature was implemented it required some further changes to the workflows which ended up becoming a quite large refactoring. But in terms of effects on the quality of the analysis the only relevant changes are probably:

using UMIs in the dedup for all TGA workflows
adding adapter-trimming to the UMI workflow
updating Sentieon version

Some more info regarding using the UMIs in the dedup step. It could rescue a significant number of reads wrongfully discarded as duplicates with a purely position-based approach, and could probably replace the UMI workflow for many use-cases as error-correction with 3,1,1 approach may not be necessary to avoid false positives in the range of VAFs commonly looked for in the samples previously run in the UMI workflow (0.5 - 2% VAF)

For more info see user story: #1361

Further necessary features before merging this:

Replace Vardict with TNscope (OR verify that VarDict is performing better / same with these changes): feat: replace vardict with tnscope #1429
Rebuild TGA PONs with the new TGA UMI collapsed bams feat: update cnvkit pons #1465

Extra information:

Extra note:

Cap base quality script was added because Manta crashes if it detects bases in the reads with a higher quality than 70, which some UMI reads get after the UMI collapse
samtools collate + samtools fixmate was added to the TGA workflow because of bug in the new Sentieon dedup consensus tool
awk script and samtools calmd added to fix further issues in bamfile, documented here: [Bug] Sentieon Dedup with UMI consensus BAM issues #1472 and in the google sheet: https://docs.google.com/spreadsheets/d/1IZo1hcfYte2Vd7TjzQ-SjdX1wOqk9rHSNMQXvsDJbUM/edit?gid=573365273#gid=573365273

Blocked by:

Update Sentieon: [User Story] Update Sentieon #1384
This update requires updating Sentieon as the current version in production: sentieon-genomics-202010.02
Does not contain the consensus option for LocusCollector which is the step prior to Dedup:

Current prod:

Latest Sentieon version:

Issues to consider:

Percent duplicate metrics:

After implementing this, we are no longer able to retrieve % Duplicates and Optical Duplicates info from the ".metrics" file from dedup. The values are "0". How can this be fixed? I have notified Sentieon about this issue and they say that they will add more useful statistics to the report in the next version.

At the moment I have added standard non UMI Sentieon Dedup just to get the PERCENT DUPLICATES stats. But is it worth it?

UPDATE: In the latest version of Sentieon 202308.03 they claim to have added support for % Duplicates from the --consensus mode from Dedup, so maybe this extra rule can be removed!

Comparing the percent duplicates between using Dedup with or without the UMI consensus options shows that they are a bit different, which makes sense as we're now using UMIs to identify duplicates as well as position. With roughly 4% less duplicates seen in this example case while using UMIs. Meaning that we're should be both getting a better calculation of duplicates which will be good for future evaluations of lab-methods, and at the same time saving more reads.

	% duplicates dedup	% duplicates dedup --consensus
normal	0.540074	0.481727
tumor	0.569414	0.510155

With this new version of sentieon we can safely remove the extra Sentieon dedup!

Collapsed singleton or pairs?

Are overlapping mates in read-pairs maintained as Pairs or Singletons after UMI collapse (as in the other Sentieon UMI collapse tool)? I checked and they are singletons when the mates are overlapping.

I discussed with Sentieon and colleagues and it should be fine: #1361 But it would be good to add more SVs from TGA to the validation set and make sure we can find them.

Graph overview

TGA T+N

TGA T+N (3,1,1 UMI workflow)

TGA T+N (QC)

WGS T+N

TGA CNVkit PON

Added

UMI extraction and deduplication to TGA workflow
Adapter trimming of fastqs to UMI workflow
Cap base quality in bam for Manta input

Changed

Refactored multi workflow rule-files to separate files to decrease complexity
Refactored output files to in general comply with format {sample_type}.{sample_name}
Replaced Picard QC tools with matching Sentieon QC tools

Removed

UMI specific rules for UMI-extraction and alignment (using new TGA-rules instead)
Fastq and UMI trimming command-line options

Documentation

N/A
Updated Balsamic documentation to reflect the changes as needed for this PR.
- [Document Name]

Tests

Feature Tests

All QC metrics are still detectable in deliverables yaml for TGA T+N and WGS T+N
[ ]

Aggregate QC metric comparison to v15.0.0

Sheet with QC metric comparison:
https://docs.google.com/spreadsheets/d/1phztyRT1VS8ud6bf0JWwFo5PB2yWB0YgFXz_ZnXSF7U/edit?gid=0#gid=0

Notes on QC metrics:

In general QC metrics are improved, but at the cost of somewhat longer run-times, increased by around 50%. This is due to the need of some extra rules for extracting UMIs and post-processing the bamfiles (the cap_base_qualities script and the Sentieon UMI extract rules, but there's no multi-threading option for the UMI extract). However hopefully this will be compensated by replacing VarDict with TNscope, we'll have to see, either way the TGA workflow is already quite fast so while it's a negative development it shouldn't be that bad.

On the positive side the QC metrics show a increase in coverage of between 2-20% after using the UMIs, which should increase the quality of the analysis and reduce the amount samples failing coverage thresholds in production.

There's also a quite significant change in the number of variants called by VarDict in the TGA T/N case, of around 17%. But I will investigate this change in the associated PR where VarDict will be replaced by TNscope: #1429

QC metrics look good and we can proceed without further investigation.

Pipeline Integrity Tests

Report deliver (generation of the .hk file)
- N/A
- Verified
TGA T/O Workflow
- N/A
- Verified
TGA T/N Workflow
- N/A
- Verified
UMI T/O Workflow
- N/A
- Verified
UMI T/N Workflow
- N/A
- Verified
WGS T/O Workflow
- N/A
- Verified
WGS T/N Workflow
- N/A
- Verified
QC Workflow
- N/A
- Verified
PON Workflow
- N/A
- Verified

Clinical Genomics Stockholm

Documentation

Atlas documentation
- N/A
- Updated: [Link]
Web portal for Clinical Genomics
- N/A
- Updated: [Link]

User Changes

N/A
This PR affects the output files or results.
- User feedback is considered unnecessary because [Justification].
- Affected users have been included in the development process and given a chance to provide feedback.

Infrastructure Changes

Stored files in Housekeeper
- N/A
- Updated: [Link]
CG (CLI and delivered/uploaded files)
- N/A
- Updated: [Link]
Servers (configuration files on Hasta)
- N/A
- Updated: [Link]
Scout interface
- N/A
- Updated: [Link]

Checklist

Important

Ensure that all checkboxes below are ticked before merging.

For Developers

PR Description
- Provided a comprehensive description of the PR.
- Linked relevant user stories or issues to the PR.
Documentation
- Verified and updated documentation if necessary.
Tests
- Described and tested the functionality addressed in the PR.
- Ensured integration of the new code with existing workflows.
- Confirmed that meaningful unit tests were added for the changes introduced.
- Checked that the PR has successfully passed all relevant code smells and coverage checks.
Review
- Addressed and resolved all the feedback provided during the code review process.
- Obtained final approval from designated reviewers.

For Reviewers

Code
- Code implements the intended features or fixes the reported issue.
- Code follows the project's coding standards and style guide.
Documentation
- Pipeline changes are well-documented in the CHANGELOG and relevant documentation.
Tests
- The author provided a description of their manual testing, including consideration of edge cases and boundary
  conditions where applicable, with satisfactory results.
Review
- Confirmed that the developer has addressed all the comments during the code review.

…eadername

codecov · 2024-01-16T17:00:41Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.48%. Comparing base (7d529e6) to head (a50ff90).
Report is 13 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #1358   +/-   ##
========================================
  Coverage    99.48%   99.48%           
========================================
  Files           40       40           
  Lines         1932     1957   +25     
========================================
+ Hits          1922     1947   +25     
  Misses          10       10

Flag	Coverage Δ
unittests	`99.48% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sonarcloud · 2024-01-19T11:31:24Z

Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

8 New issues
0 Security Hotspots
No data about Coverage
0.6% Duplication on New Code

See analysis details on SonarCloud

BALSAMIC/assets/scripts/cap_base_quality_in_bam.py

BALSAMIC/snakemake_rules/align/tga_sentieon_alignment.rule

BALSAMIC/snakemake_rules/align/wgs_bam_postprocess.rule

ivadym

Codewise looks great, as well as the results! 💯 Amazing work and fantastic documentation of the process! 🤩

mathiasbio · 2024-07-11T08:05:58Z

Thanks for the reviews! As discussed I'll not merge this however until I have fully tested the replace VarDict PR, and before this PR is entirely complete I'll still need to add some documentation 🙏

sonarcloud · 2024-08-15T12:59:44Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

mathiasbio added 14 commits October 20, 2023 14:43

restructure fastp trim for TGA to extract UMIs into tags instead of h…

db91c59

…eadername

divided alignment in two

6bbaa86

add gzip to interleaved umi trimmed fastqs

33c5e8e

remove detect adapters for pe in fastp

7df5e79

add rule sleep before starting analysis

8d348cf

merge release13

d85089e

add extra rule to PON workflow

5ac29cd

merge release 13

5b3c2bd

increase time to 200 sec

f184460

add sleep requirement to concatenate umi

3ac67a5

fix qc workflow

11064da

fix merge conflict

eb6d5c1

add new rules and functionality to create consensus reads in dedup

557b064

fix bugs

8e6d29b

mathiasbio added 6 commits January 16, 2024 18:06

hardcode new version of sentieon

62f3b11

remove failing env variable assignment

ad2e0f8

remove env in umi rule

1568ff6

fix header

aa04556

fix bug

e923137

fix header command

1a38197

mathiasbio linked an issue Jan 19, 2024 that may be closed by this pull request

[User Story] Deduplication with consensus collapse #1361

Open

3 tasks

mathiasbio added 2 commits January 19, 2024 12:24

fix bugs

985f73c

fix bug

ac0af6f

mathiasbio added 2 commits April 15, 2024 12:01

merge develop

8c8d241

fix changelog

8330ef3

mathiasbio changed the base branch from develop to update_sentieon April 15, 2024 10:14

mathiasbio changed the title ~~Deduplicate with UMIs~~ feat: deduplicate with UMIs Apr 15, 2024

mathiasbio mentioned this pull request Apr 15, 2024

feat: replace vardict with tnscope #1429

Open

58 tasks

fix coverage

a79d9f9

ivadym reviewed Jul 10, 2024

View reviewed changes

BALSAMIC/assets/scripts/cap_base_quality_in_bam.py Outdated Show resolved Hide resolved

ivadym reviewed Jul 10, 2024

View reviewed changes

BALSAMIC/snakemake_rules/align/tga_sentieon_alignment.rule Show resolved Hide resolved

ivadym reviewed Jul 10, 2024

View reviewed changes

BALSAMIC/snakemake_rules/align/wgs_bam_postprocess.rule Outdated Show resolved Hide resolved

ivadym approved these changes Jul 10, 2024

View reviewed changes

fix

d609c99

mathiasbio added 4 commits July 11, 2024 13:50

update documentation

133a0dd

adjust docs

21e7032

remove header params

33f3df1

fix pon final file

0249d05

This was referenced Jul 25, 2024

[User Story] Update TIDDIT to make use of single-end reads #1467

Open

feat: extend bedfiles for CNV analysis #1469

Merged

mathiasbio added 7 commits July 30, 2024 11:53

remove duplicate rules refactoring

babb85d

fix issues with bam format

d7a1e42

fix

804795f

fix

f665d0d

add more tmp dirs

afb731b

increase speed by sorting before calmd

a6589d5

fix

154e52b

This was referenced Aug 13, 2024

[Bug] Sentieon Dedup with UMI consensus BAM issues #1472

Closed

fix: TGA optical dup detection #1292

Closed

mathiasbio linked an issue Aug 15, 2024 that may be closed by this pull request

Fix detect optical duplicates in TGA workflow #1291

Open

mathiasbio self-assigned this Aug 15, 2024

mathiasbio added this to the Release 16 milestone Aug 15, 2024

mathiasbio mentioned this pull request Aug 15, 2024

[User Story] Record optical duplicates in QC report #1282

Open

5 tasks

mathiasbio linked an issue Aug 15, 2024 that may be closed by this pull request

[User Story] Record optical duplicates in QC report #1282

Open

5 tasks

change umi edit distance to 1

a50ff90

mathiasbio mentioned this pull request Sep 2, 2024

feat: merge vardict and tnscope #1475

Open

65 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: deduplicate with UMIs #1358

feat: deduplicate with UMIs #1358

mathiasbio commented Jan 16, 2024 •

edited

Loading

codecov bot commented Jan 16, 2024 •

edited

Loading

sonarcloud bot commented Jan 19, 2024

ivadym left a comment

mathiasbio commented Jul 11, 2024

sonarcloud bot commented Aug 15, 2024

feat: deduplicate with UMIs #1358

Are you sure you want to change the base?

feat: deduplicate with UMIs #1358

Conversation

mathiasbio commented Jan 16, 2024 • edited Loading

This PR:

Further necessary features before merging this:

Extra information:

Graph overview

Added

Changed

Removed

Documentation

Tests

Feature Tests

Aggregate QC metric comparison to v15.0.0

Pipeline Integrity Tests

Clinical Genomics Stockholm

Documentation

User Changes

Infrastructure Changes

Checklist

For Developers

For Reviewers

codecov bot commented Jan 16, 2024 • edited Loading

Codecov Report

sonarcloud bot commented Jan 19, 2024

Quality Gate passed

ivadym left a comment

Choose a reason for hiding this comment

mathiasbio commented Jul 11, 2024

sonarcloud bot commented Aug 15, 2024

Quality Gate passed

mathiasbio commented Jan 16, 2024 •

edited

Loading

codecov bot commented Jan 16, 2024 •

edited

Loading