Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: merge vardict and tnscope #1475

Open
wants to merge 201 commits into
base: cnvkit_to_gens
Choose a base branch
from

Conversation

mathiasbio
Copy link
Contributor

@mathiasbio mathiasbio commented Aug 23, 2024

Description

This PR is a branch of:
cnvkit_to_gens: #1448
-->
Which is a branch of:
update_cnvkit_pons: #1465
-->
Which is a branch of:
deduplicate_with_umi #1358

Part of the development work in this PR was originally made here: #1429 and which was partially reviewed by Vadym.

All upstream branches affect the quality of the analysis, and the full extent of these effects will be assessed in this PR in a sort of mini-validation.

The original plan for release 16.0.0 of balsamic was to replace VarDict with TNscope, and while all the tests were passing and the evaluation showed the changes to be an improvement in the analysis overall, the testing also revealed some confusing and unstable behaviour of TNscope which led to us deciding to keep VarDict and merge the TNscope results.

This PR in particular

This PR merges TNscope and VarDict results for the TGA workflows, and cleans up the snakemake rules a bit in general, such as removing the rules for TNhaplotyper which has not been in use for a long time.

For now it also removes the ML-model in TNscope since there is currently no available ML model supported for the new version of Sentieon. They are working on making a new model.

I also added a new filter to VarDict for allowing 30% of tumor in normal contamination to align with what we're doing for TNscope.

Up to date graph of a TGA UMI T+N workflow below:

image

Changed

  • Refactored rules for bcftools filters
  • Renamed final UMI bamfile to ensure hsmetrics are collected in multiqc json
  • Changed ranked VCF from research to clincial

Added

  • TNscope for TGA workflows, merged with VarDict results
  • New filter for VarDict for tumor in normal contamination
  • Export TMP environment variables to rules that lack them
  • Added genmod ranked VCFs to be delivered

Removed

  • ML-model for TNscope is removed due to license issue with new version of Sentieon
  • All code associated with TNhaplotyper

Documentation

  • N/A
  • Updated Balsamic documentation to reflect the changes as needed for this PR.
    • balsamic_filters
    • balsamic_methods
    • balsamic_pon

Tests

Feature Tests

Google sheet with results here: https://docs.google.com/spreadsheets/d/13qjetgWKu9rD3hxfTfL6NNv9R_KkXsIOuqntBft6JvM/edit?usp=sharing

  • Quality of TGA analysis should be the same or improved for samples (compared to v15.0.0 validation)

Summary of results in google sheet

coverage

  • Coverage is improved across all samples tested after the update to the BAM creation

number of variants

  • Number of variants has been assessed in all workflows

There are some differences in the number of variants between this workflow and the previous.

  • Fewer SNVs in tumor normal TGA analysis

Despite merging the vardict and TNscope variant calls, the tumor normal cases have fewer variants this release than the previous. In the validation the variants that are filtered out will be assessed in more detail, but for now it can be noted that in the TWIST pancancer reference samples (which are tumor and normal cases) the sensitivity was improved compared to last version, suggesting that the variants that are filtered out are artefacts. The possible reasons to explain it are the UMI consensus collapse performing some degree of error correction, but the primary reason is probably the new tumor normal filter added to the TGA analysis to match the one already implemented in WGS (removing variants if they had 30% of tumor presence in normal).

  • More SNVs in tumor only TGA analysis

From earlier version of testing:

In the example case run for TGA tumor only there was a 125% increase in number of SNVs in the new version after merging the results. This is very different to the comparison of the tumor only WES case, which actually decreased in amount of variants by 3.76%. Likely this comes from the difference in sequencing depth of these cases, where the more highly covered TGA case opens up the possibility of detecting more low frequency variants where the callers are more likely to diverge in their algorithms for calling variants, leading to more unique variants added by TNscope in comparison to the WES case. 

I looked at a subsection of extra variants called in this TGA case in IGV to determine if they seemed like valuable additions or not, and saw an issue with TNscope calling variants in soft-clipped ends. There is an option to add to TNscope called trim-soft-clipped ends which I will try to fix it.

After adding new TNscope option: --trim-soft-clip

The increase in SNVs in the same tumor only case increased by 98%. I verified in IGV that the variants previously called in soft clipped ends were no gone. Still the number of variants in this case is almost double of what it was previously. 

I looked in IGV for the extra variants called, and many of them looked like true SNVs, others were InDels that were maybe more dubious as they were called in quite repetitive regions. Though these should be less impactful for the clinicians  as they occur more frequently in introns and introgenic regions and should be filtered out by most filters in Scout. 

My guess from this is that given the clear increase in sensitivity in the tumor normal TWIST pancancer cases, there are likely many added true positive variants from TNscope also in the tumor only case. Though the clear effect on sensitivity and precision on the tumor only analysis is unclear as we do not have enough true positive variants to measure this. 

I will inform the customers that commonly order myeloid analysis to get their opinion on the increase in number of variants.

horizon FLT3 variants

  • FLT3 variants are detected in all cases (Yes, but is missing in all TNscope calls and in one Manta call)

horizon SNV and InDels

  • All variants are still detected at reasonable VAFs

TWIST pancancer reference samples

  • Sensitivity and precision is of equal or better quality (better quality across all 3 samples)

SeraCare variants

  • All seracare variants are still detected at reasonable VAFs

Myeloid variants

  • All variants are still detected

CNV profile in GENS

Is the CNV profile visualisation in GENS working?

CNV profile links here

  • gmcksolid_4.1_hg19_design.bed with known EGFR duplication observed in CNV profile
  • twistexomecomprehensive 10.2 case with clean CNV profile

CNV profile in GENS with and without using PON

5 PONS have been created in this upstream PR and which has not been fully evaluated:
update_cnvkit_pons: #1465

  • gmcksolid 4.1
  • gmcksolid 4.2
  • gmslymphoid 7.3
  • gmsmyeloid 5.3
  • gmssolid 15.2
  • twistexomecomprehensive 10.2

However, as we do not possess a set of cases with known CNVs, and have not validated the CNV analysis in the past. And even regardless of these cases (which would be very nice) the best way to evaluate the PONs is still to study the CNV profile by eye and determine if the profile is easier to interpret with the PON or without.

To save time I would suggest that we do this evaluation in the validation itself, which is uniquely possible for this feature specifically as adding a PON does not impact the code itself. If a PON would be deemed to produce noisy results we can simply not add it to the reference directory.

  • PONs are beneficial to the CNV analysis (To be evaluated in validation step)

Feature Tests

  • N/A
  • Test [Description]
    • [Screenshot]

Pipeline Integrity Tests

  • Report deliver (generation of the .hk file)
    • N/A
    • Verified
  • TGA T/O Workflow
    • N/A
    • Verified
  • TGA T/N Workflow
    • N/A
    • Verified
  • UMI T/O Workflow
    • N/A
    • Verified
  • UMI T/N Workflow
    • N/A
    • Verified
  • WGS T/O Workflow
    • N/A
    • Verified
  • WGS T/N Workflow
    • N/A
    • Verified
  • QC Workflow
    • N/A
    • Verified
  • PON Workflow
    • N/A
    • Verified

Clinical Genomics Stockholm

Documentation

Panel of Normal specific criteria

HOWEVER! We still need approval for 2 of the PoNs #1465

User Changes

  • N/A
  • This PR affects the output files or results.
    • User feedback is considered unnecessary because [Justification].
    • Affected users have been included in the development process and given a chance to provide feedback.

Infrastructure Changes

  • Stored files in Housekeeper
    • N/A
    • Updated: [Link]
  • CG (CLI and delivered/uploaded files)
    • N/A
    • Updated: [Link]
  • Servers (configuration files on Hasta)
    • N/A
    • Updated: [Link]
  • Scout interface
    • N/A
    • Updated: [Link]

Checklist

Important

Ensure that all checkboxes below are ticked before merging.

For Developers

  • PR Description
    • Provided a comprehensive description of the PR.
    • Linked relevant user stories or issues to the PR.
  • Documentation
    • Verified and updated documentation if necessary.
  • Tests
    • Described and tested the functionality addressed in the PR.
    • Ensured integration of the new code with existing workflows.
    • Confirmed that meaningful unit tests were added for the changes introduced.
    • Checked that the PR has successfully passed all relevant code smells and coverage checks.
  • Review
    • Addressed and resolved all the feedback provided during the code review process.
    • Obtained final approval from designated reviewers.

For Reviewers

  • Code
    • Code implements the intended features or fixes the reported issue.
    • Code follows the project's coding standards and style guide.
  • Documentation
    • Pipeline changes are well-documented in the CHANGELOG and relevant documentation.
  • Tests
    • The author provided a description of their manual testing, including consideration of edge cases and boundary
      conditions where applicable, with satisfactory results.
  • Review
    • Confirmed that the developer has addressed all the comments during the code review.

@mathiasbio mathiasbio added this to the Release 16 milestone Sep 2, 2024
@mathiasbio mathiasbio marked this pull request as ready for review September 9, 2024 17:22
@mathiasbio mathiasbio requested a review from a team as a code owner September 9, 2024 17:22
@mathiasbio
Copy link
Contributor Author

Regarding the extra variants in the tumor only TGA cases I have emailed our myeloid customer to ask for feedback as they would be the ones most affected by this change.

Copy link

sonarcloud bot commented Sep 17, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Review
Development

Successfully merging this pull request may close these issues.

1 participant