Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few transcripts without strand infomation in the extended_annotations.gtf #403

Closed
creaturemoon opened this issue Nov 21, 2023 · 4 comments

Comments

@creaturemoon
Copy link

Hi team,
I run multiple samples simultaneously through Bambu

samples <- list.files("./","bam")
se <- bambu(reads = samples, annotations = annotations, genome = fa.file)
writeBambuOutput(se)

And there are a few multi-exon transcripts in novel and known genes lacking chain infomation labed as . in the output file extended_annotations.gtf like this

2 Bambu transcript 17649948 17653242 . . . gene_id "BambuGene1753"; transcript_id "BambuTx385";
2 Bambu exon 17649948 17650501 . . . gene_id "BambuGene1753"; transcript_id "BambuTx385"; exon_number "1";
2 Bambu exon 17650977 17651153 . . . gene_id "BambuGene1753"; transcript_id "BambuTx385"; exon_number "2";
2 Bambu exon 17651279 17651488 . . . gene_id "BambuGene1753"; transcript_id "BambuTx385"; exon_number "3";
2 Bambu exon 17652656 17653120 . . . gene_id "BambuGene1753"; transcript_id "BambuTx385"; exon_number "4";
2 Bambu exon 17653184 17653242 . . . gene_id "BambuGene1753"; transcript_id "BambuTx385";

Using cut -f7 extended_annotations.gtf | sort | uniq -c, I got

126846 -
82 .
123992 +

There is no . in the reference genome annotation file. Can anyone explain why they are marked as .? And how should I handle these transcripts?
Any help is much appreciated!

@andredsim
Copy link
Collaborator

Hi there,

Novel spliced transcripts should be assigned a strand. I tried doing some tests on my side and was not able to replicate the observations you made. Could you please share your sessionInfo() and a minimal testable example (for example extracting just the part of the bam file that includes this novel gene) that reproduces this issue so that I can try resolve it?

Does BambuGene1753 have any other isoforms other than BambuTx385? Or is this a single isoform gene? The other thing that is odd about this output is the final line does not have an exon_number but all exons in extended_annotations.gtf should. Is this just due to it being cut off from pasting it here, or is it missing in the gtf file?

Kind Regards,
Andre Sim

@creaturemoon
Copy link
Author

Hi Andre,

Thanks for your response. BambuGene1753 is a single isoform gene and has no other isoforms other than BambuTx385. And the last line of exon_number "5"; was not added due to my negligence.
The files required for testing are in the link ( https://www.jianguoyun.com/p/DUyf1m0Qm4yVCxjqmakFIAA ), where the bam file is over 100MB. I found that small bam files containing the same location may not generate transcripts without chain information, while larger files may produce this situation. In the test generated gtf file, the following results without chain information were obtained. Both FBgn0031284 and FBgn0051922 have other isoforms with strand information.

2L Bambu transcript 818075 821307 . . . gene_id "FBgn0031284"; transcript_id "BambuTx2";
2L Bambu exon 818075 818643 . . . gene_id "FBgn0031284"; transcript_id "BambuTx2"; exon_number "1";
2L Bambu exon 818704 819438 . . . gene_id "FBgn0031284"; transcript_id "BambuTx2"; exon_number "2";
2L Bambu exon 820030 821307 . . . gene_id "FBgn0031284"; transcript_id "BambuTx2"; exon_number "3";
2L Bambu transcript 1167147 1170438 . . . gene_id "FBgn0051922"; transcript_id "BambuTx9";
2L Bambu exon 1167147 1167954 . . . gene_id "FBgn0051922"; transcript_id "BambuTx9"; exon_number "1";
2L Bambu exon 1168012 1168637 . . . gene_id "FBgn0051922"; transcript_id "BambuTx9"; exon_number "2";
2L Bambu exon 1168690 1169576 . . . gene_id "FBgn0051922"; transcript_id "BambuTx9"; exon_number "3";
2L Bambu exon 1169643 1169824 . . . gene_id "FBgn0051922"; transcript_id "BambuTx9"; exon_number "4";
2L Bambu exon 1170228 1170438 . . . gene_id "FBgn0051922"; transcript_id "BambuTx9"; exon_number "5";

And my sessionInfo() is,

R version 4.3.2 (2023-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /nfs_data/liumy/00.software/mambaforge/envs/bambu3/lib/libopenblasp-r0.3.25.so; LAPACK version 3.11.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=zh_CN.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=zh_CN.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=zh_CN.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C
time zone: Asia/Shanghai
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] bambu_3.2.4 BSgenome_1.68.0
[3] rtracklayer_1.60.0 Biostrings_2.68.1
[5] XVector_0.40.0 SummarizedExperiment_1.30.2
[7] Biobase_2.60.0 GenomicRanges_1.52.0
[9] GenomeInfoDb_1.36.1 IRanges_2.34.1
[11] S4Vectors_0.38.1 BiocGenerics_0.46.0
[13] MatrixGenerics_1.12.2 matrixStats_1.1.0
loaded via a namespace (and not attached):
[1] KEGGREST_1.40.0 rjson_0.2.21 lattice_0.22-5
[4] vctrs_0.6.5 tools_4.3.2 bitops_1.0-7
[7] generics_0.1.3 curl_5.1.0 parallel_4.3.2
[10] tibble_3.2.1 fansi_1.0.5 AnnotationDbi_1.62.2
[13] RSQLite_2.3.3 blob_1.2.4 pkgconfig_2.0.3
[16] Matrix_1.6-4 data.table_1.14.8 dbplyr_2.4.0
[19] lifecycle_1.0.4 GenomeInfoDbData_1.2.10 compiler_4.3.2
[22] stringr_1.5.1 Rsamtools_2.16.0 progress_1.2.2
[25] codetools_0.2-19 RCurl_1.98-1.13 yaml_2.3.7
[28] tidyr_1.3.0 pillar_1.9.0 crayon_1.5.2
[31] BiocParallel_1.34.2 DelayedArray_0.26.6 cachem_1.0.8
[34] tidyselect_1.2.0 digest_0.6.33 stringi_1.8.2
[37] purrr_1.0.2 dplyr_1.1.4 restfulr_0.0.15
[40] biomaRt_2.56.1 fastmap_1.1.1 grid_4.3.2
[43] cli_3.6.1 magrittr_2.0.3 S4Arrays_1.0.4
[46] GenomicFeatures_1.52.1 utf8_1.2.4 XML_3.99-0.16
[49] withr_2.5.2 rappdirs_0.3.3 filelock_1.0.2
[52] prettyunits_1.2.0 xgboost_1.7.6.1 bit64_4.0.5
[55] httr_1.4.7 bit_4.0.5 png_0.1-8
[58] hms_1.1.3 memoise_2.0.1 BiocIO_1.10.0
[61] BiocFileCache_2.8.0 rlang_1.1.2 Rcpp_1.0.11
[64] glue_1.6.2 DBI_1.1.3 BiocManager_1.30.22
[67] xml2_1.3.5 jsonlite_1.8.7 R6_2.5.1
[70] GenomicAlignments_1.36.0 zlibbioc_1.46.0

Many thanks!

@andredsim
Copy link
Collaborator

Hi,

Thanks for providing me these files. I had a more in-depth look at BambuTx2 and for this particular transcript there are two splice junctions, the first has the "CT-AC" motif and the second "GC-AG". The first motif is associated with canonical negative strand transcripts and the second with positive. As the reads in unstranded protocols themselves provide no strand information we attempt to guess them using splice junctions, using the direction which has the most junctions supporting them. Earlier I wrongly said that novel spliced transcripts should be assigned a strand, as in this case there is equal support for it being from the positive and negative strand which is why it is assigned '*' as we cannot be certain of the strand.

Because of this ambiguity in stranding you may want to remove these transcripts from the reference annotations depending on what you plan to do with these annotations downstream. However there does appear to be convincing read support for these junctions, therefore if you were interested in this gene locus, experimental validation would be recommended to confirm what was seen with Nanopore sequencing.

I hope this explains why there are spliced transcripts labeled with '*' and hopefully gives you some guidance on how to proceed.
Let me know if you have further questions.

Kind Regards,
Andre Sim

@creaturemoon
Copy link
Author

Hi Andre,

Thanks very much. Your explanation is very helpful to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants