Address RefSeq transcript misalignments #447

holtgrewe · 2019-05-03T14:07:03Z

RefSeq transcripts can align with indels and mismatches to the reference sequence. While mismatches could be argued to be non-critical (assuming the GenBank entries that the RefSeq transcript is based on is from healthy individuals), indels cannot.

For hg19, 884 transcripts in 501 genes are affected.

The following solution will be implemented:

The default_sources.ini file gets a settings "fixIndels" and "fixIndelsUcsc".
When parsing the RefSeq transcript database, the Note attribute is analyzed.
If it contains the substrings "indel" or "substitution" then this is recorded into the built TranscriptModel.
When fixIndels=true is given then the user also has to provide the path to the reference sequence.
The file at fixIndelsUcsc is used for providing the UCSC transcript alignments.
This will be used for the exon and CDS information.
The sequence will be taken from the reference.

NB: This will create an incompatibility between the databases built before and after Jannovar v0.29.

For each hg*/refseq* entry, a _fixindel variant is added that contains these fix transcripts. This way, the fixed transcripts are strictly opt-in and only supplement those where the indel is not fixed. Variants for both can be reported.

The text was updated successfully, but these errors were encountered:

holtgrewe · 2019-05-07T07:09:05Z

This approach for the resolution will work for now. A proper resolution will be done in #450.

@pnrobinson @julesjacobsen as ENSEMBL is not affected this will probably not be of super big interest to you

@roland-ewald I think this might be important for you to know

A good example is LTBP4:

UCSC
ENSEMBL

roland-ewald · 2019-05-07T20:26:38Z

@holtgrewe Thank you for the heads-up, I'm aware of the issue. BTW besides the Note attribute there is also the Gap attribute in the GFF file which gives the exact position and structure of the mismatch between reference genome and transcript (for example, Gap=M123 I1 M456 means after 123 matching nt the GFF region has one additional nucleotide that is not in the reference genome). It might be safer to use this attribute for 'indel-detection' instead of relying on the Note attribute.

holtgrewe · 2019-05-07T20:37:24Z

ohhhh! thanks for this info

roland-ewald · 2019-05-07T20:42:24Z

Sure, you're welcome!

holtgrewe · 2019-05-08T03:25:10Z

After looking into this some more, I think the correct way is to use the alignments from the RefSeq file that are stored separately from the exons themselves. It is important to both capture indels in the local alignments themselves (the Gap attribute) and the boundaries of the alignments in the Target attribute.

For example, NM_014747.2 does not have any record with a Gap attribute but the first alignment is Target=NM_014747.2 5 263 +, that is 4 bases need to be clipped from the transcript sequence.

This was referenced May 3, 2019

NullPointerException when annotating CNVs #444

Closed

Resolving indel fixes. #448

Merged

Use correct alignments of RefSeq transcripts #450

Closed

holtgrewe added this to the 0.29 milestone May 7, 2019

holtgrewe closed this as completed May 7, 2019

holtgrewe reopened this May 7, 2019

holtgrewe modified the milestones: 0.29, 0.31 May 21, 2019

holtgrewe closed this as completed May 23, 2019

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address RefSeq transcript misalignments #447

Address RefSeq transcript misalignments #447

holtgrewe commented May 3, 2019 •

edited

Loading

holtgrewe commented May 7, 2019

roland-ewald commented May 7, 2019

holtgrewe commented May 7, 2019

roland-ewald commented May 7, 2019

holtgrewe commented May 8, 2019

Address RefSeq transcript misalignments #447

Address RefSeq transcript misalignments #447

Comments

holtgrewe commented May 3, 2019 • edited Loading

holtgrewe commented May 7, 2019

roland-ewald commented May 7, 2019

holtgrewe commented May 7, 2019

roland-ewald commented May 7, 2019

holtgrewe commented May 8, 2019

holtgrewe commented May 3, 2019 •

edited

Loading