Merging transcript gtfs across multiple samples #444

apsteinberg · 2024-08-29T14:16:10Z

Hello,

Thank you for all your help getting Bambu set up for my analyses. I am working with a dataset of 44 samples, and we are actively growing this cohort. I'm wondering if there is a way you would recommend merging the output from individual Bambu runs across these samples? I know Bambu has the ability to run multiple samples at once, but I think it makes more sense for us to merge post-analysis here since we are actively increasing our sample count. My original plan was to take the output gtfs and use Stringtie merge to create a unified gtf, then run Bambu in quantify mode to get transcript counts, but I thought that perhaps there is some way to do it with Bambu itself?

Thanks,
Asher

The text was updated successfully, but these errors were encountered:

andredsim · 2024-08-30T01:16:47Z

Hi Asher,

If you kept the rcOutDir output which will be a .rds file, you can run the extend annotation step again which should only take a couple of minutes, be sure to set quant to FALSE.

mergedAnno = bambu(reads = c("output.rds", "output2.rds"), fa.file = "genome.fa", annotations = annotations, quant = FALSE)

Otherwise you will need to regenerate the rcFiles again like so.

sample1 = bambu(reads = c(sample1, sample2), fa.file = "genome.fa", annotations = annotations, discovery = FALSE, quant = FALSE, rcOutDir = "outputDir/") 
sample2 = bambu(reads = c(sample1, sample2), fa.file = "genome.fa", annotations = annotations, discovery = FALSE, quant = FALSE, rcOutDir = "outputDir/") 

mergedAnno = bambu(reads = c(sample1, sample2), fa.file = "genome.fa", annotations = annotations, quant = FALSE)

If you only have the output gtfs, there is no way to merge those at this stage, so you could do the Stringtie merge method you mentioned, but just be aware this will increase your false positive rate.

Kind Regards,
Andre Sim

apsteinberg · 2024-08-30T21:40:48Z

Hi Andre,

Thanks! I'm giving this a try now. It seems a bit slower than I would expect currently (it's taking an hour or so even with quant = FALSE. Would you recommend trying to use the ncores flag to try to provide more computing power?

Best,
Asher

apsteinberg · 2024-08-30T21:48:38Z

Also sorry one more follow-up, it's okay to provide an NDR here with the NDR option right? I specified as follows:

mergedAnno <- bambu(reads = rds, 
                   genome = ref_genome, 
                   annotations = annotations, 
                   quant = FALSE,
                   NDR = 0.1,
                   verbose=TRUE
                   )

Thanks,
Asher

andredsim · 2024-09-02T05:11:19Z

Hi Asher,

Unfortunately if you are running using the read class .rds files the extendAnnotations step isn't parallelized so changing the ncore won't help. If you are regenerating the .rds files, higher cores will increase the speed in which they are generated. However I am surprised it would take longer than an hour, with only 44 samples.

One thing that happens to me is if I load in the .rds files, they should be in a list, or else it will try and iterate through 1 summarized experiment object, which takes ages and eventually crashes, so just double check you arn't doing that. The input should either be a vector of paths to the .rds files, or a list of the summarized experiment objects.

Upon more reflection, I think the long run time could be due to combining single exon read classes which are not usually used for discovery anyway. You can try this branch remov_SE_combineAnno. It recently had the master branched merged into it, so the results should be the same, but skips the single exon combination step.

Regarding the second question, yes you can use the NDR option here and you are doing it correctly.

Let me know if this helps.

Kind Regards,
Andre Sim

apsteinberg · 2024-09-05T21:20:43Z

Hi Andre,

Sorry I didn't reply sooner. Got it this all makes sense!

Thanks for the warning about loading the reads files. I think that is being done correctly because it did end up outputting an annotation.

Makes sense about the single exon read classes, I will give this a try and see if it speeds things up.

Regarding the NDR option, that's great.

Thank you for all your help, and I think this is fully resolved now if you would like to close out this issue. I'm excited as this provides an easy way for us to extend our Bambu analysis across our growing cohort of samples. :)

Best wishes,
Asher

andredsim closed this as completed Sep 6, 2024

apsteinberg mentioned this issue Sep 19, 2024

Merging read classes across samples results in fewer novel transcripts #448

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging transcript gtfs across multiple samples #444

Merging transcript gtfs across multiple samples #444

apsteinberg commented Aug 29, 2024

andredsim commented Aug 30, 2024

apsteinberg commented Aug 30, 2024

apsteinberg commented Aug 30, 2024 •

edited

Loading

andredsim commented Sep 2, 2024

apsteinberg commented Sep 5, 2024

Merging transcript gtfs across multiple samples #444

Merging transcript gtfs across multiple samples #444

Comments

apsteinberg commented Aug 29, 2024

andredsim commented Aug 30, 2024

apsteinberg commented Aug 30, 2024

apsteinberg commented Aug 30, 2024 • edited Loading

andredsim commented Sep 2, 2024

apsteinberg commented Sep 5, 2024

apsteinberg commented Aug 30, 2024 •

edited

Loading