Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging transcript gtfs across multiple samples #444

Closed
apsteinberg opened this issue Aug 29, 2024 · 5 comments
Closed

Merging transcript gtfs across multiple samples #444

apsteinberg opened this issue Aug 29, 2024 · 5 comments

Comments

@apsteinberg
Copy link

Hello,

Thank you for all your help getting Bambu set up for my analyses. I am working with a dataset of 44 samples, and we are actively growing this cohort. I'm wondering if there is a way you would recommend merging the output from individual Bambu runs across these samples? I know Bambu has the ability to run multiple samples at once, but I think it makes more sense for us to merge post-analysis here since we are actively increasing our sample count. My original plan was to take the output gtfs and use Stringtie merge to create a unified gtf, then run Bambu in quantify mode to get transcript counts, but I thought that perhaps there is some way to do it with Bambu itself?

Thanks,
Asher

@andredsim
Copy link
Collaborator

Hi Asher,

If you kept the rcOutDir output which will be a .rds file, you can run the extend annotation step again which should only take a couple of minutes, be sure to set quant to FALSE.

mergedAnno = bambu(reads = c("output.rds", "output2.rds"), fa.file = "genome.fa", annotations = annotations, quant = FALSE)

Otherwise you will need to regenerate the rcFiles again like so.

sample1 = bambu(reads = c(sample1, sample2), fa.file = "genome.fa", annotations = annotations, discovery = FALSE, quant = FALSE, rcOutDir = "outputDir/") 
sample2 = bambu(reads = c(sample1, sample2), fa.file = "genome.fa", annotations = annotations, discovery = FALSE, quant = FALSE, rcOutDir = "outputDir/") 

mergedAnno = bambu(reads = c(sample1, sample2), fa.file = "genome.fa", annotations = annotations, quant = FALSE) 

If you only have the output gtfs, there is no way to merge those at this stage, so you could do the Stringtie merge method you mentioned, but just be aware this will increase your false positive rate.

Kind Regards,
Andre Sim

@apsteinberg
Copy link
Author

Hi Andre,

Thanks! I'm giving this a try now. It seems a bit slower than I would expect currently (it's taking an hour or so even with quant = FALSE. Would you recommend trying to use the ncores flag to try to provide more computing power?

Best,
Asher

@apsteinberg
Copy link
Author

apsteinberg commented Aug 30, 2024

Also sorry one more follow-up, it's okay to provide an NDR here with the NDR option right? I specified as follows:

mergedAnno <- bambu(reads = rds, 
                   genome = ref_genome, 
                   annotations = annotations, 
                   quant = FALSE,
                   NDR = 0.1,
                   verbose=TRUE
                   ) 

Thanks,
Asher

@andredsim
Copy link
Collaborator

Hi Asher,

Unfortunately if you are running using the read class .rds files the extendAnnotations step isn't parallelized so changing the ncore won't help. If you are regenerating the .rds files, higher cores will increase the speed in which they are generated. However I am surprised it would take longer than an hour, with only 44 samples.

One thing that happens to me is if I load in the .rds files, they should be in a list, or else it will try and iterate through 1 summarized experiment object, which takes ages and eventually crashes, so just double check you arn't doing that. The input should either be a vector of paths to the .rds files, or a list of the summarized experiment objects.

Upon more reflection, I think the long run time could be due to combining single exon read classes which are not usually used for discovery anyway. You can try this branch remov_SE_combineAnno. It recently had the master branched merged into it, so the results should be the same, but skips the single exon combination step.

Regarding the second question, yes you can use the NDR option here and you are doing it correctly.

Let me know if this helps.

Kind Regards,
Andre Sim

@apsteinberg
Copy link
Author

Hi Andre,

Sorry I didn't reply sooner. Got it this all makes sense!

Thanks for the warning about loading the reads files. I think that is being done correctly because it did end up outputting an annotation.

Makes sense about the single exon read classes, I will give this a try and see if it speeds things up.

Regarding the NDR option, that's great.

Thank you for all your help, and I think this is fully resolved now if you would like to close out this issue. I'm excited as this provides an easy way for us to extend our Bambu analysis across our growing cohort of samples. :)

Best wishes,
Asher

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants