Skip to content

VCF preprocessing explanation

Boas Pucker edited this page Mar 28, 2019 · 4 revisions

Warning and additional information

This module is not a quality control mechanism, but necessary to remove conflicting data. Currently, entries are ranked by position to keep upstream variant and to drop downstream variants if their position overlaps with any upstream variants. In theory, large InDels could cause the omission of following SNVs. Read the section "Changes and filter mechanics" for detailed information.

Purpose

The purpose of the "VCF preprocessing" module is to correct and simplify the VCF input file. Conflicting data in the input VCF file could lead to false prediction or program failure.

Output

The input VCF file will be split into two new VCF files, one for each allele. If alleles are phased, SnpEff can consider this during prediction. If the input VCF contains polyploid entries, the preprocessing module should be run several times. For diploid species, the files will be named 'first.vcf' and 'second.vcf'. An additional logfile is generated, which lists every discarded entry and the reason for the removal. Position changes can occur while dissolving complex lines due to conventions of the VCF format. Please see "Changes and filter mechanics" for details.

Changes and filter mechanics

There are two main steps while filtering the VCF file:
Dissolving complex lines, the resulting position changes and splitting of allele entries into two new files is the first step. The second step is the removal of all conflicting lines.

Dissolving and splitting

Entries like:
20 1234567 microsat1 GTC G,GTCT
can be problematic. The first allele entry is a deletion of TC at the position 1234567 and the second allele entry is an insertion of T two bases downstream:
20 1234567 microsat1 GTC G
20 1234569 microsat1 C CT
In this case, a common problem are additional entries between or at 1234567 and 1234569. Additional variants at 1234567 are not an issue for the second allele entry, but a different entry at 1234569 would cause problems. Something like:
20 1234569 microsat1 C CT
and
20 1234569 microsat1 C CG
could happen. The solution here:
Allele entries will be split into two files, with every 'normal' entry for further prediction.

Removing conflicting data

Entries like:
20 1234567 microsat1 GTC G
are removing the reference positions 1234568 and 1234569. But there are sometimes entries at this positions:
20 1234567 microsat1 GTC G
20 1234568 microsat1 T G
The deletion and substitution of the same base on the same position is a problem. This can happen while merging VCF files or inappropriate filters. Solution here:
Take the first entry and remove all conflicting entries in the next lines. Additional information:
Everything is listed in the logfile.
Deletions can remove a lot of entries, even if these downstream variants have a higher quality.
When the VCF file is sorted by Chr,Pos,and Qual the best entry is the first and will be taken.