Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect codon files generated by scikit-ribo-build for mm10 #7

Open
nabeel-bioinfo opened this issue Jul 26, 2018 · 1 comment
Open

Comments

@nabeel-bioinfo
Copy link

nabeel-bioinfo commented Jul 26, 2018

I have been trying to use scikit-ribo on mouse ESC Ribo-Seq data but the output files generated from scikit-ribo-build do no correctly parse the given input GFF file. I am using the mm10_knownGene GFF file as input with genome file obtained from http://hgdownload.cse.ucsc.edu/goldenPath/mm10/chromosomes/.
Specifically, there is inaccurate assignment of codons of transcripts.

For example, for gene uc008cjg.1, the start codon is at chr17: 35916361-35916363 while in the files mm10.codons.df and mm10.codons.bed, this chromosome location is at codon 2
$ cat mm10.codons.bed | grep 'uc008cjg.1' | head -12

chr17	35916390	35916393	uc008cjg.1	-8	-	CGG
chr17	35916387	35916390	uc008cjg.1	-7	-	CAG
chr17	35916384	35916387	uc008cjg.1	-6	-	GCC
chr17	35916381	35916384	uc008cjg.1	-5	-	GTA
chr17	35916378	35916381	uc008cjg.1	-4	-	CGT
chr17	35916375	35916378	uc008cjg.1	-3	-	CCG
chr17	35916372	35916375	uc008cjg.1	-2	-	GTG
chr17	35916369	35916372	uc008cjg.1	-1	-	TGC
chr17	35916366	35916369	uc008cjg.1	0	-	GTC
chr17	35916363	35916366	uc008cjg.1	1	-	AAG
chr17	35916360	35916363	uc008cjg.1	2	-	ATG
chr17	35916357	35916360	uc008cjg.1	3	-	GCT

Another example where the codon assignments are entirely different from the reference is uc007vnb.1. The start codon is at chr15:37003817-37003820 but the codon file has very different annotations.

$cat mm10.codons.bed | grep 'uc007vnb.1' | head -20

chr15	37007423	37007426	uc007vnb.1	-8	-	CCG
chr15	37007420	37007423	uc007vnb.1	-7	-	CAG
chr15	37007417	37007420	uc007vnb.1	-6	-	GAA
chr15	37007414	37007417	uc007vnb.1	-5	-	GGT
chr15	37007411	37007414	uc007vnb.1	-4	-	GAG
chr15	37007408	37007411	uc007vnb.1	-3	-	GGG
chr15	37007405	37007408	uc007vnb.1	-2	-	CGG
chr15	37007402	37007405	uc007vnb.1	-1	-	GAC
chr15	37007399	37007402	uc007vnb.1	0	-	CGC
chr15	37007396	37007399	uc007vnb.1	1	-	GCG
chr15	37007393	37007396	uc007vnb.1	2	-	GGC
chr15	37007390	37007393	uc007vnb.1	3	-	AAG

Assuming that codon 0 is the start codon in this file, I found that only 3380(7%) of 46355 transcripts have ATG as start codon

cat mm10.codons.bed | awk '{if($5==0)print $5,$7}' | sort | uniq -c

727 0 AAA
    436 0 AAC
    748 0 AAG
    376 0 AAT
    686 0 ACA
    473 0 ACC
    295 0 ACG
    854 0 ACT
   1768 0 AGA
   1212 0 AGC
   1207 0 AGG
   1724 0 AGT
    377 0 ATA
    522 0 ATC
   **3380 0 ATG**
    704 0 ATT
    228 0 CAA
    377 0 CAC
    493 0 CAG
    167 0 CAT
    400 0 CCA
    569 0 CCC
    399 0 CCG
    477 0 CCT
    188 0 CGA
    602 0 CGC
    637 0 CGG
    136 0 CGT
    202 0 CTA
    831 0 CTC
    625 0 CTG
    594 0 CTT
    990 0 GAA
    779 0 GAC
   1888 0 GAG
    585 0 GAT
    921 0 GCA
   1177 0 GCC
   1318 0 GCG
    928 0 GCT
   1856 0 GGA
   1691 0 GGC
   2329 0 GGG
   1037 0 GGT
    523 0 GTA
   1015 0 GTC
   1304 0 GTG
    916 0 GTT
     90 0 TAA
     88 0 TAC
    184 0 TAG
    115 0 TAT
    284 0 TCA
    401 0 TCC
    162 0 TCG
    386 0 TCT
    285 0 TGA
    392 0 TGC
    460 0 TGG
    270 0 TGT
    155 0 TTA
    506 0 TTC
    274 0 TTG
    632 0 TTT

I hope this bug can be fixed so that scikit-ribo can be run on mm10 data.

@mschatz
Copy link
Collaborator

mschatz commented Aug 3, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants