Bambu could not tell the gene id right from the input gtf file #434

abcdtree · 2024-06-25T23:08:42Z

Hi,

Thanks for providing such a useful tool.

I am running bambu on ferret sequence, and the gtf format as below:

NW_025422005.1	Gnomon	transcript	20261931	20265242	.	-	.	transcript_id "rna-XM_045064039.1"; gene_id "gene-IFI6"; gene_name "IFI6"
NW_025422005.1	Gnomon	exon	20261931	20262296	.	-	.	transcript_id "rna-XM_045064039.1"; gene_id "gene-IFI6"; gene_name "IFI6";
...
NW_025422005.1	Gnomon	transcript	20262114	20265254	.	-	.	transcript_id "rna-XM_004741051.3"; gene_id "gene-IFI6"; gene_name "IFI6"
NW_025422005.1	Gnomon	exon	20262114	20262296	.	-	.	transcript_id "rna-XM_004741051.3"; gene_id "gene-IFI6"; gene_name "IFI6";

After the quantification and transcript discovery, the gene id output in the count file will be like:

transcript_id rna-XM_004741051.3; gene-IFI6	9375	4878	11277	7865	83	71	17	42	14080	146	46	40	6090	6682
transcript_id rna-XM_045064039.1; gene-IFI6	3617	1990	4674	3927	40	42	23	18	77	41	58	17	18	2409	2277

This is the gene count table, and for this IFI6 gene, there are two rows (records), which did not merged together, because of the chopped gene id as transcript_id rna-XM_004741051.3; gene-IFI6 instead of gene-IFI6 only.

This issue also makes the extended_annotation.gtf looks messy. as below:

NW_025422005.1	Bambu	transcript	20262114	20265254	.	-	.	gene_id "transcript_id rna-XM_004741051.3; gene-IFI6"; transcript_id "rna-XM_004741051.3";
NW_025422005.1	Bambu	exon	20262114	20262296	.	-	.	gene_id "transcript_id rna-XM_004741051.3; gene-IFI6"; transcript_id "rna-XM_004741051.3"; exon_number "5";
NW_025422005.1	Bambu	exon	20263370	20263562

I just wonder whether I could merge the counts from the same gene together in this case. Or whether I need to reformat my input gtf to fix it?

Cheers,

Josh

The text was updated successfully, but these errors were encountered:

andredsim · 2024-06-26T02:50:08Z

Hi Josh,

This is quite strange, and it is not intended behavior. I checked the regex patterns we use for our gtf parser function on the snippet you provided and it produces the expected results (the transcript id is not incorporated into the gene name). Would you please be able to upload a truncated version of your input gtf so that I can test in Bambu and see if I can replicate the issue?

Kind Regards,
Andre Sim

abcdtree · 2024-06-26T05:31:03Z

ferret.gtf.zip
Hi Dear Andre,

Thank you so much for checking. I uploaded the gtf file I used here.
Hope you can replicate the issue.

Cheers,

Josh

andredsim · 2024-06-26T06:00:27Z

Hi Josh,

Thanks for sending that. I was able to find the cause quite quickly thanks to that, it was because we (incorrectly) expected the gene_id to be the first attribute as is the case in many gtf files, however in gtf file the transcript_id came first, which is why it got wrapped up.

I have created a branch "fix_gtf_parse_bug" which implements the small fix that should resolve this issue. Please pull that branch and see if it works for you.

Thanks,
Andre Sim

andredsim closed this as completed Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bambu could not tell the gene id right from the input gtf file #434

Bambu could not tell the gene id right from the input gtf file #434

abcdtree commented Jun 25, 2024

andredsim commented Jun 26, 2024

abcdtree commented Jun 26, 2024

andredsim commented Jun 26, 2024

Bambu could not tell the gene id right from the input gtf file #434

Bambu could not tell the gene id right from the input gtf file #434

Comments

abcdtree commented Jun 25, 2024

andredsim commented Jun 26, 2024

abcdtree commented Jun 26, 2024

andredsim commented Jun 26, 2024