Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address RefSeq transcript misalignments #447

Closed
holtgrewe opened this issue May 3, 2019 · 5 comments
Closed

Address RefSeq transcript misalignments #447

holtgrewe opened this issue May 3, 2019 · 5 comments
Milestone

Comments

@holtgrewe
Copy link
Member

holtgrewe commented May 3, 2019

RefSeq transcripts can align with indels and mismatches to the reference sequence. While mismatches could be argued to be non-critical (assuming the GenBank entries that the RefSeq transcript is based on is from healthy individuals), indels cannot.

For hg19, 884 transcripts in 501 genes are affected.

The following solution will be implemented:

  • The default_sources.ini file gets a settings "fixIndels" and "fixIndelsUcsc".
  • When parsing the RefSeq transcript database, the Note attribute is analyzed.
    If it contains the substrings "indel" or "substitution" then this is recorded into the built TranscriptModel.
  • When fixIndels=true is given then the user also has to provide the path to the reference sequence.
  • The file at fixIndelsUcsc is used for providing the UCSC transcript alignments.
    This will be used for the exon and CDS information.
    The sequence will be taken from the reference.

NB: This will create an incompatibility between the databases built before and after Jannovar v0.29.

For each hg*/refseq* entry, a _fixindel variant is added that contains these fix transcripts. This way, the fixed transcripts are strictly opt-in and only supplement those where the indel is not fixed. Variants for both can be reported.

@holtgrewe
Copy link
Member Author

This approach for the resolution will work for now. A proper resolution will be done in #450.

@pnrobinson @julesjacobsen as ENSEMBL is not affected this will probably not be of super big interest to you

@roland-ewald I think this might be important for you to know

A good example is LTBP4:

@holtgrewe holtgrewe reopened this May 7, 2019
@roland-ewald
Copy link
Contributor

@holtgrewe Thank you for the heads-up, I'm aware of the issue. BTW besides the Note attribute there is also the Gap attribute in the GFF file which gives the exact position and structure of the mismatch between reference genome and transcript (for example, Gap=M123 I1 M456 means after 123 matching nt the GFF region has one additional nucleotide that is not in the reference genome). It might be safer to use this attribute for 'indel-detection' instead of relying on the Note attribute.

@holtgrewe
Copy link
Member Author

ohhhh! thanks for this info

@roland-ewald
Copy link
Contributor

Sure, you're welcome!

@holtgrewe
Copy link
Member Author

After looking into this some more, I think the correct way is to use the alignments from the RefSeq file that are stored separately from the exons themselves. It is important to both capture indels in the local alignments themselves (the Gap attribute) and the boundaries of the alignments in the Target attribute.

For example, NM_014747.2 does not have any record with a Gap attribute but the first alignment is Target=NM_014747.2 5 263 +, that is 4 bases need to be clipped from the transcript sequence.

@holtgrewe holtgrewe modified the milestones: 0.29, 0.31 May 21, 2019
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants