Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bambu could not tell the gene id right from the input gtf file #434

Closed
abcdtree opened this issue Jun 25, 2024 · 3 comments
Closed

Bambu could not tell the gene id right from the input gtf file #434

abcdtree opened this issue Jun 25, 2024 · 3 comments

Comments

@abcdtree
Copy link

Hi,

Thanks for providing such a useful tool.

I am running bambu on ferret sequence, and the gtf format as below:

NW_025422005.1	Gnomon	transcript	20261931	20265242	.	-	.	transcript_id "rna-XM_045064039.1"; gene_id "gene-IFI6"; gene_name "IFI6"
NW_025422005.1	Gnomon	exon	20261931	20262296	.	-	.	transcript_id "rna-XM_045064039.1"; gene_id "gene-IFI6"; gene_name "IFI6";
...
NW_025422005.1	Gnomon	transcript	20262114	20265254	.	-	.	transcript_id "rna-XM_004741051.3"; gene_id "gene-IFI6"; gene_name "IFI6"
NW_025422005.1	Gnomon	exon	20262114	20262296	.	-	.	transcript_id "rna-XM_004741051.3"; gene_id "gene-IFI6"; gene_name "IFI6";

After the quantification and transcript discovery, the gene id output in the count file will be like:

transcript_id rna-XM_004741051.3; gene-IFI6	9375	4878	11277	7865	83	71	17	42	14080	146	46	40	6090	6682
transcript_id rna-XM_045064039.1; gene-IFI6	3617	1990	4674	3927	40	42	23	18	77	41	58	17	18	2409	2277

This is the gene count table, and for this IFI6 gene, there are two rows (records), which did not merged together, because of the chopped gene id as transcript_id rna-XM_004741051.3; gene-IFI6 instead of gene-IFI6 only.

This issue also makes the extended_annotation.gtf looks messy. as below:

NW_025422005.1	Bambu	transcript	20262114	20265254	.	-	.	gene_id "transcript_id rna-XM_004741051.3; gene-IFI6"; transcript_id "rna-XM_004741051.3";
NW_025422005.1	Bambu	exon	20262114	20262296	.	-	.	gene_id "transcript_id rna-XM_004741051.3; gene-IFI6"; transcript_id "rna-XM_004741051.3"; exon_number "5";
NW_025422005.1	Bambu	exon	20263370	20263562

I just wonder whether I could merge the counts from the same gene together in this case. Or whether I need to reformat my input gtf to fix it?

Cheers,

Josh

@andredsim
Copy link
Collaborator

Hi Josh,

This is quite strange, and it is not intended behavior. I checked the regex patterns we use for our gtf parser function on the snippet you provided and it produces the expected results (the transcript id is not incorporated into the gene name). Would you please be able to upload a truncated version of your input gtf so that I can test in Bambu and see if I can replicate the issue?

Kind Regards,
Andre Sim

@abcdtree
Copy link
Author

ferret.gtf.zip
Hi Dear Andre,

Thank you so much for checking. I uploaded the gtf file I used here.
Hope you can replicate the issue.

Cheers,

Josh

@andredsim
Copy link
Collaborator

Hi Josh,

Thanks for sending that. I was able to find the cause quite quickly thanks to that, it was because we (incorrectly) expected the gene_id to be the first attribute as is the case in many gtf files, however in gtf file the transcript_id came first, which is why it got wrapped up.

I have created a branch "fix_gtf_parse_bug" which implements the small fix that should resolve this issue. Please pull that branch and see if it works for you.

Thanks,
Andre Sim

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants