A few transcripts without strand infomation in the extended_annotations.gtf #403

creaturemoon · 2023-11-21T15:12:20Z

Hi team,
I run multiple samples simultaneously through Bambu

samples <- list.files("./","bam")
se <- bambu(reads = samples, annotations = annotations, genome = fa.file)
writeBambuOutput(se)

And there are a few multi-exon transcripts in novel and known genes lacking chain infomation labed as . in the output file extended_annotations.gtf like this

2 Bambu transcript 17649948 17653242 . . . gene_id "BambuGene1753"; transcript_id "BambuTx385";
2 Bambu exon 17649948 17650501 . . . gene_id "BambuGene1753"; transcript_id "BambuTx385"; exon_number "1";
2 Bambu exon 17650977 17651153 . . . gene_id "BambuGene1753"; transcript_id "BambuTx385"; exon_number "2";
2 Bambu exon 17651279 17651488 . . . gene_id "BambuGene1753"; transcript_id "BambuTx385"; exon_number "3";
2 Bambu exon 17652656 17653120 . . . gene_id "BambuGene1753"; transcript_id "BambuTx385"; exon_number "4";
2 Bambu exon 17653184 17653242 . . . gene_id "BambuGene1753"; transcript_id "BambuTx385";

Using cut -f7 extended_annotations.gtf | sort | uniq -c, I got

126846 -
82 .
123992 +

There is no . in the reference genome annotation file. Can anyone explain why they are marked as .? And how should I handle these transcripts?
Any help is much appreciated!

The text was updated successfully, but these errors were encountered:

andredsim · 2023-11-27T02:36:54Z

Hi there,

Novel spliced transcripts should be assigned a strand. I tried doing some tests on my side and was not able to replicate the observations you made. Could you please share your sessionInfo() and a minimal testable example (for example extracting just the part of the bam file that includes this novel gene) that reproduces this issue so that I can try resolve it?

Does BambuGene1753 have any other isoforms other than BambuTx385? Or is this a single isoform gene? The other thing that is odd about this output is the final line does not have an exon_number but all exons in extended_annotations.gtf should. Is this just due to it being cut off from pasting it here, or is it missing in the gtf file?

Kind Regards,
Andre Sim

creaturemoon · 2023-12-03T07:04:21Z

Hi Andre,

Thanks for your response. BambuGene1753 is a single isoform gene and has no other isoforms other than BambuTx385. And the last line of exon_number "5"; was not added due to my negligence.
The files required for testing are in the link ( https://www.jianguoyun.com/p/DUyf1m0Qm4yVCxjqmakFIAA ), where the bam file is over 100MB. I found that small bam files containing the same location may not generate transcripts without chain information, while larger files may produce this situation. In the test generated gtf file, the following results without chain information were obtained. Both FBgn0031284 and FBgn0051922 have other isoforms with strand information.

2L Bambu transcript 818075 821307 . . . gene_id "FBgn0031284"; transcript_id "BambuTx2";
2L Bambu exon 818075 818643 . . . gene_id "FBgn0031284"; transcript_id "BambuTx2"; exon_number "1";
2L Bambu exon 818704 819438 . . . gene_id "FBgn0031284"; transcript_id "BambuTx2"; exon_number "2";
2L Bambu exon 820030 821307 . . . gene_id "FBgn0031284"; transcript_id "BambuTx2"; exon_number "3";
2L Bambu transcript 1167147 1170438 . . . gene_id "FBgn0051922"; transcript_id "BambuTx9";
2L Bambu exon 1167147 1167954 . . . gene_id "FBgn0051922"; transcript_id "BambuTx9"; exon_number "1";
2L Bambu exon 1168012 1168637 . . . gene_id "FBgn0051922"; transcript_id "BambuTx9"; exon_number "2";
2L Bambu exon 1168690 1169576 . . . gene_id "FBgn0051922"; transcript_id "BambuTx9"; exon_number "3";
2L Bambu exon 1169643 1169824 . . . gene_id "FBgn0051922"; transcript_id "BambuTx9"; exon_number "4";
2L Bambu exon 1170228 1170438 . . . gene_id "FBgn0051922"; transcript_id "BambuTx9"; exon_number "5";

And my sessionInfo() is,

R version 4.3.2 (2023-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /nfs_data/liumy/00.software/mambaforge/envs/bambu3/lib/libopenblasp-r0.3.25.so; LAPACK version 3.11.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=zh_CN.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=zh_CN.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=zh_CN.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C
time zone: Asia/Shanghai
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] bambu_3.2.4 BSgenome_1.68.0
[3] rtracklayer_1.60.0 Biostrings_2.68.1
[5] XVector_0.40.0 SummarizedExperiment_1.30.2
[7] Biobase_2.60.0 GenomicRanges_1.52.0
[9] GenomeInfoDb_1.36.1 IRanges_2.34.1
[11] S4Vectors_0.38.1 BiocGenerics_0.46.0
[13] MatrixGenerics_1.12.2 matrixStats_1.1.0
loaded via a namespace (and not attached):
[1] KEGGREST_1.40.0 rjson_0.2.21 lattice_0.22-5
[4] vctrs_0.6.5 tools_4.3.2 bitops_1.0-7
[7] generics_0.1.3 curl_5.1.0 parallel_4.3.2
[10] tibble_3.2.1 fansi_1.0.5 AnnotationDbi_1.62.2
[13] RSQLite_2.3.3 blob_1.2.4 pkgconfig_2.0.3
[16] Matrix_1.6-4 data.table_1.14.8 dbplyr_2.4.0
[19] lifecycle_1.0.4 GenomeInfoDbData_1.2.10 compiler_4.3.2
[22] stringr_1.5.1 Rsamtools_2.16.0 progress_1.2.2
[25] codetools_0.2-19 RCurl_1.98-1.13 yaml_2.3.7
[28] tidyr_1.3.0 pillar_1.9.0 crayon_1.5.2
[31] BiocParallel_1.34.2 DelayedArray_0.26.6 cachem_1.0.8
[34] tidyselect_1.2.0 digest_0.6.33 stringi_1.8.2
[37] purrr_1.0.2 dplyr_1.1.4 restfulr_0.0.15
[40] biomaRt_2.56.1 fastmap_1.1.1 grid_4.3.2
[43] cli_3.6.1 magrittr_2.0.3 S4Arrays_1.0.4
[46] GenomicFeatures_1.52.1 utf8_1.2.4 XML_3.99-0.16
[49] withr_2.5.2 rappdirs_0.3.3 filelock_1.0.2
[52] prettyunits_1.2.0 xgboost_1.7.6.1 bit64_4.0.5
[55] httr_1.4.7 bit_4.0.5 png_0.1-8
[58] hms_1.1.3 memoise_2.0.1 BiocIO_1.10.0
[61] BiocFileCache_2.8.0 rlang_1.1.2 Rcpp_1.0.11
[64] glue_1.6.2 DBI_1.1.3 BiocManager_1.30.22
[67] xml2_1.3.5 jsonlite_1.8.7 R6_2.5.1
[70] GenomicAlignments_1.36.0 zlibbioc_1.46.0

Many thanks!

andredsim · 2023-12-04T04:12:41Z

Hi,

Thanks for providing me these files. I had a more in-depth look at BambuTx2 and for this particular transcript there are two splice junctions, the first has the "CT-AC" motif and the second "GC-AG". The first motif is associated with canonical negative strand transcripts and the second with positive. As the reads in unstranded protocols themselves provide no strand information we attempt to guess them using splice junctions, using the direction which has the most junctions supporting them. Earlier I wrongly said that novel spliced transcripts should be assigned a strand, as in this case there is equal support for it being from the positive and negative strand which is why it is assigned '*' as we cannot be certain of the strand.

Because of this ambiguity in stranding you may want to remove these transcripts from the reference annotations depending on what you plan to do with these annotations downstream. However there does appear to be convincing read support for these junctions, therefore if you were interested in this gene locus, experimental validation would be recommended to confirm what was seen with Nanopore sequencing.

I hope this explains why there are spliced transcripts labeled with '*' and hopefully gives you some guidance on how to proceed.
Let me know if you have further questions.

Kind Regards,
Andre Sim

creaturemoon · 2023-12-04T11:52:14Z

Hi Andre,

Thanks very much. Your explanation is very helpful to me.

creaturemoon closed this as completed Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A few transcripts without strand infomation in the extended_annotations.gtf #403

A few transcripts without strand infomation in the extended_annotations.gtf #403

creaturemoon commented Nov 21, 2023

andredsim commented Nov 27, 2023

creaturemoon commented Dec 3, 2023

andredsim commented Dec 4, 2023

creaturemoon commented Dec 4, 2023

A few transcripts without strand infomation in the extended_annotations.gtf #403

A few transcripts without strand infomation in the extended_annotations.gtf #403

Comments

creaturemoon commented Nov 21, 2023

andredsim commented Nov 27, 2023

creaturemoon commented Dec 3, 2023

andredsim commented Dec 4, 2023

creaturemoon commented Dec 4, 2023