Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: deduplicate with UMIs #1358

Open
wants to merge 176 commits into
base: develop
Choose a base branch
from
Open

feat: deduplicate with UMIs #1358

wants to merge 176 commits into from

Conversation

mathiasbio
Copy link
Contributor

@mathiasbio mathiasbio commented Jan 16, 2024

This PR:

Makes some significant changes to the TGA workflow by aiming to use the UMIs during the deduplication stage. This is a feature that only exists on the later versions of Sentieon.

The PR is tested on Sentieon version: sentieon-genomics-202308.03 which has this feature.

As this new feature was implemented it required some further changes to the workflows which ended up becoming a quite large refactoring. But in terms of effects on the quality of the analysis the only relevant changes are probably:

  • using UMIs in the dedup for all TGA workflows
  • adding adapter-trimming to the UMI workflow
  • updating Sentieon version

Some more info regarding using the UMIs in the dedup step. It could rescue a significant number of reads wrongfully discarded as duplicates with a purely position-based approach, and could probably replace the UMI workflow for many use-cases as error-correction with 3,1,1 approach may not be necessary to avoid false positives in the range of VAFs commonly looked for in the samples previously run in the UMI workflow (0.5 - 2% VAF)

For more info see user story: #1361

Further necessary features before merging this:

Extra information:

Extra note:

Blocked by:

  • Update Sentieon: [User Story] Update Sentieon #1384
    This update requires updating Sentieon as the current version in production: sentieon-genomics-202010.02
    Does not contain the consensus option for LocusCollector which is the step prior to Dedup:

Current prod:
image

Latest Sentieon version:
image

Issues to consider:

  • Percent duplicate metrics:

After implementing this, we are no longer able to retrieve % Duplicates and Optical Duplicates info from the ".metrics" file from dedup. The values are "0". How can this be fixed? I have notified Sentieon about this issue and they say that they will add more useful statistics to the report in the next version.

At the moment I have added standard non UMI Sentieon Dedup just to get the PERCENT DUPLICATES stats. But is it worth it?

UPDATE: In the latest version of Sentieon 202308.03 they claim to have added support for % Duplicates from the --consensus mode from Dedup, so maybe this extra rule can be removed!

Comparing the percent duplicates between using Dedup with or without the UMI consensus options shows that they are a bit different, which makes sense as we're now using UMIs to identify duplicates as well as position. With roughly 4% less duplicates seen in this example case while using UMIs. Meaning that we're should be both getting a better calculation of duplicates which will be good for future evaluations of lab-methods, and at the same time saving more reads.

% duplicates dedup % duplicates dedup --consensus
normal 0.540074 0.481727
tumor 0.569414 0.510155

With this new version of sentieon we can safely remove the extra Sentieon dedup!

  • Collapsed singleton or pairs?

Are overlapping mates in read-pairs maintained as Pairs or Singletons after UMI collapse (as in the other Sentieon UMI collapse tool)? I checked and they are singletons when the mates are overlapping.

I discussed with Sentieon and colleagues and it should be fine: #1361 But it would be good to add more SVs from TGA to the validation set and make sure we can find them.

Graph overview

TGA T+N

image

TGA T+N (3,1,1 UMI workflow)

image

TGA T+N (QC)

image

WGS T+N

image

TGA CNVkit PON

image

Added

  • UMI extraction and deduplication to TGA workflow
  • Adapter trimming of fastqs to UMI workflow
  • Cap base quality in bam for Manta input

Changed

  • Refactored multi workflow rule-files to separate files to decrease complexity
  • Refactored output files to in general comply with format {sample_type}.{sample_name}
  • Replaced Picard QC tools with matching Sentieon QC tools

Removed

  • UMI specific rules for UMI-extraction and alignment (using new TGA-rules instead)
  • Fastq and UMI trimming command-line options

Documentation

  • N/A
  • Updated Balsamic documentation to reflect the changes as needed for this PR.
    • [Document Name]

Tests

Feature Tests

  • All QC metrics are still detectable in deliverables yaml for TGA T+N and WGS T+N
  • [ ]

Aggregate QC metric comparison to v15.0.0

Sheet with QC metric comparison:
https://docs.google.com/spreadsheets/d/1phztyRT1VS8ud6bf0JWwFo5PB2yWB0YgFXz_ZnXSF7U/edit?gid=0#gid=0

Notes on QC metrics:

In general QC metrics are improved, but at the cost of somewhat longer run-times, increased by around 50%. This is due to the need of some extra rules for extracting UMIs and post-processing the bamfiles (the cap_base_qualities script and the Sentieon UMI extract rules, but there's no multi-threading option for the UMI extract). However hopefully this will be compensated by replacing VarDict with TNscope, we'll have to see, either way the TGA workflow is already quite fast so while it's a negative development it shouldn't be that bad.

On the positive side the QC metrics show a increase in coverage of between 2-20% after using the UMIs, which should increase the quality of the analysis and reduce the amount samples failing coverage thresholds in production.

There's also a quite significant change in the number of variants called by VarDict in the TGA T/N case, of around 17%. But I will investigate this change in the associated PR where VarDict will be replaced by TNscope: #1429

  • QC metrics look good and we can proceed without further investigation.

Pipeline Integrity Tests

  • Report deliver (generation of the .hk file)
    • N/A
    • Verified
  • TGA T/O Workflow
    • N/A
    • Verified
  • TGA T/N Workflow
    • N/A
    • Verified
  • UMI T/O Workflow
    • N/A
    • Verified
  • UMI T/N Workflow
    • N/A
    • Verified
  • WGS T/O Workflow
    • N/A
    • Verified
  • WGS T/N Workflow
    • N/A
    • Verified
  • QC Workflow
    • N/A
    • Verified
  • PON Workflow
    • N/A
    • Verified

Clinical Genomics Stockholm

Documentation

  • Atlas documentation
    • N/A
    • Updated: [Link]
  • Web portal for Clinical Genomics
    • N/A
    • Updated: [Link]

User Changes

  • N/A
  • This PR affects the output files or results.
    • User feedback is considered unnecessary because [Justification].
    • Affected users have been included in the development process and given a chance to provide feedback.

Infrastructure Changes

  • Stored files in Housekeeper
    • N/A
    • Updated: [Link]
  • CG (CLI and delivered/uploaded files)
    • N/A
    • Updated: [Link]
  • Servers (configuration files on Hasta)
    • N/A
    • Updated: [Link]
  • Scout interface
    • N/A
    • Updated: [Link]

Checklist

Important

Ensure that all checkboxes below are ticked before merging.

For Developers

  • PR Description
    • Provided a comprehensive description of the PR.
    • Linked relevant user stories or issues to the PR.
  • Documentation
    • Verified and updated documentation if necessary.
  • Tests
    • Described and tested the functionality addressed in the PR.
    • Ensured integration of the new code with existing workflows.
    • Confirmed that meaningful unit tests were added for the changes introduced.
    • Checked that the PR has successfully passed all relevant code smells and coverage checks.
  • Review
    • Addressed and resolved all the feedback provided during the code review process.
    • Obtained final approval from designated reviewers.

For Reviewers

  • Code
    • Code implements the intended features or fixes the reported issue.
    • Code follows the project's coding standards and style guide.
  • Documentation
    • Pipeline changes are well-documented in the CHANGELOG and relevant documentation.
  • Tests
    • The author provided a description of their manual testing, including consideration of edge cases and boundary
      conditions where applicable, with satisfactory results.
  • Review
    • Confirmed that the developer has addressed all the comments during the code review.

Copy link

codecov bot commented Jan 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.48%. Comparing base (7d529e6) to head (a50ff90).
Report is 13 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #1358   +/-   ##
========================================
  Coverage    99.48%   99.48%           
========================================
  Files           40       40           
  Lines         1932     1957   +25     
========================================
+ Hits          1922     1947   +25     
  Misses          10       10           
Flag Coverage Δ
unittests 99.48% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mathiasbio mathiasbio linked an issue Jan 19, 2024 that may be closed by this pull request
3 tasks
Copy link

sonarcloud bot commented Jan 19, 2024

Quality Gate Passed Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

8 New issues
0 Security Hotspots
No data about Coverage
0.6% Duplication on New Code

See analysis details on SonarCloud

@mathiasbio mathiasbio changed the base branch from develop to update_sentieon April 15, 2024 10:14
@mathiasbio mathiasbio changed the title Deduplicate with UMIs feat: deduplicate with UMIs Apr 15, 2024
Copy link
Contributor

@ivadym ivadym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codewise looks great, as well as the results! 💯 Amazing work and fantastic documentation of the process! 🤩

@mathiasbio
Copy link
Contributor Author

Thanks for the reviews! As discussed I'll not merge this however until I have fully tested the replace VarDict PR, and before this PR is entirely complete I'll still need to add some documentation 🙏

@mathiasbio mathiasbio linked an issue Aug 15, 2024 that may be closed by this pull request
@mathiasbio mathiasbio self-assigned this Aug 15, 2024
@mathiasbio mathiasbio added this to the Release 16 milestone Aug 15, 2024
@mathiasbio mathiasbio linked an issue Aug 15, 2024 that may be closed by this pull request
5 tasks
Copy link

sonarcloud bot commented Aug 15, 2024

@mathiasbio mathiasbio mentioned this pull request Sep 2, 2024
65 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Testing
3 participants