Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I apply downstream filtering steps to the cleaned VCF to further control the false discovery rate? #615

Open
elitefugua opened this issue Oct 30, 2023 · 1 comment
Labels
question Further information is requested

Comments

@elitefugua
Copy link

How can I apply downstream filtering steps (minGQ filtering, FilterOutlierSamples, BatchEffect, and FilterCleanupQualRecalibration) to the cleaned VCF to further control the false discovery rate?

@mwalker174
Copy link
Collaborator

MinGQ filtering is being deprecated in favor of the FilterGenotypes workflow. We are still working on documentation but here is an outline:

Main steps:

  • JoinRawCalls
  • FormatVCFForGatk (if needed)
  • SVConcordance
  • FilterGenotypes

Instructions:

Generate inputs and run JoinRawCalls from the main branch to produce a clustered VCF of raw calls for the entire cohort.

Note: the input VCFs here use svtk-style END fields, which need to be modified no matter what version of the pipeline was used, so make sure FormatVcfForGatk.formatter_args is “--fix-end”

If your “cleaned” VCF was generated prior to v0.22-beta, you must run FormatVcfForGatk to produce a GATK-formatted cleaned VCF. If you used a later version of the pipeline, you can skip this step.

IMPORTANT: The "formatter_args" parameter may need to be modified depending on exactly what version of the pipeline was used.

If your cleaned VCF has invalid END tags, you must include the "--fix-end" flag. You can check to see if there are any END fields less than the POS of the record (e.g. for BND records).

If your cleaned VCF has GQ values are scaled to 0-999, you must use the "--scale-down-gq" flag to rescale to 0-99. If you’re unsure, you should be able to tell by looking at the VCF.

Run SVConcordance with the GATK-formatted cleaned VCF as "eval_vcf" and JoinRawCalls VCF as "truth_vcf".

Run FilterGenotypes with the SVConcordance VCF

The “ploidy_table” is generated during JoinRawCalls but wasn’t included as an output for the v0.28.4-beta release. If you kept intermediate files from JoinRawCalls, you can use the output from the CreatePloidyTableFromPed task. Otherwise, you will need to generate the ploidy table manually using this script: https://github.com/broadinstitute/gatk-sv/blob/main/src/sv-pipeline/scripts/ploidy_table_from_ped.py

Recommended default "no_call_rate_cutoff" is 0.05 but can be adjusted

Modify "sl_filter_args":

High specificity (5% DNR): "--small-del-threshold 93 --medium-del-threshold 85 --small-dup-threshold -51 --medium-dup-threshold -4 --ins-threshold -13"
High sensitivity (15% DNR): "--small-del-threshold -53 --medium-del-threshold -12 --small-dup-threshold -105 --medium-dup-threshold -81 --ins-threshold -97"

Note these thresholds were generated for the first AoU-SV release, which had relatively little technical noise. You may want to iterate on these thresholds for your dataset until results are satisfactory.

Review the MainVcfQc plots generated with this workflow and compare them to the “Structural Variants QC Results” section of the AoU-SV quality report: https://support.researchallofus.org/hc/en-us/articles/4617899955092-All-of-Us-Genomic-Quality-Report.

We recommend using de novo rates as a proxy for your FDR. Plots of this are generated in MainVcfQc if you have at least one trio.

7,000-11,000 are typical counts for total (non-BND) SVs per genome.

Higher thresholds will increase precision.

For larger call sets, you can increase the amount of sharding with RecalibrateGq.recalibrate_records_per_shard and filter_vcf_records_per_shard depending on which parts are slow. Defaults for both are 20,000. It is recommended to try to keep the total number of shards under 1000.

@mwalker174 mwalker174 added the question Further information is requested label Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants