-
Notifications
You must be signed in to change notification settings - Fork 16
Era7 e. coli lb226692 annotation with bg7 system
- who Raquel Tobes & Marina Manrique (Oh no sequences! team at (Era7)
- what automatic annotation with [BG7 system] (http://www.slideshare.net/marina_manrique/bg7-a-new-system-for-bacterial-genome-annotation-designed-for-ngs-data) of E. coli LB226692 genome (see Life Tech & University of Muenster mapping assembly at assemblies)
- date 8-Jun-2011
BG7 system developed by Oh no sequences! was used to get this annotation.
BG7 is an annotation system specially designed to handle data from NGS. One of the most important features of the system is that we predict ORF searching for protein similarity. So we start with a search of the reference proteins in the contigs and then we define the ORF. We preserve all the CDS found (although they haven't canonical start or stop codons and although they have frameshifts or intrastop codon) so the system is pretty robust to NGS errors that may cause the lose of start/stop signals or changes in the frameshit.
These are the datasets we used in the annotation
We used the Life Tech - University of Muester's hybrid mapping assembly. More details on the assembly method here assemblies. You can get the assembly data from the repo: https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/blob/master/strains/LB226692/seqProject/LifeTech-MuensterUni/assemblies/LifeTech-MuensterUni/wgs.AFOB.1.fsa_nt.gz
As we din in the two annotations of E. coli TY-2482, in this case we have also taken as reference proteins a set of 137,063 proteins that includes:
- The representative Uniprot proteins corresponding to all Uniref90 clusters for all Escherichia coli proteins
- All Uniprot proteins from organisms including in their name the terms “EHEC” or “EAEC”
- All Uniprot proteins from bacteria that have in any Uniprot field the term “toxin”
- All Uniprot proteins from bacteria that have in any Uniprot field “hemolysin”
- All the proteins from Salmonella typhi, Yersinia pestis and Shigella dysenteriae
You can get the result files from the annotation from the repo: https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/tree/master/strains/LB226692/seqProject/LifeTech-MuensterUni/annotations/era7bioinformatics
We have predicted 6,302 genes
- 6,132 protein encoding genes
- 170 RNA genes (rRNA and tRNA)
4,504 out of the 6,132 (73.45%) protein encoding genes have canonical start and stop codon and haven´t either frame-shifts or intragenic stop codons.
1,125 out of the 6,132 (18.34%) protein encoding genes have some frameshifts or intragenic stop codon in their sequences, probably caused by inherent technology errors. However, our system is tolerant to errors of massive sequencing technologies and it has been able to detect this rich set of genes even with very preliminary sequencing results.
Probably some of the proteins detected are fragmented and could appear as two different predicted genes if they are at the end of different contigs.