Error in Assigning Gene IDs to Unstranded Read Clusters #368

HongYhong · 2023-04-30T08:56:33Z

Hi, thank you for contributing such a great tool. I have a large BAM sample and encountered the following problem while running it:

Error in vapply(x, NROW, integer(1)) : values must be type 'integer',
but FUN(X[[1]]) result is type 'double'

I tried to debug and found that the problem occurred when assigning gene IDs to multiple hits unstranded read clusters:

mcols(grl)$GENEID[!strandedRanges] <- assignGeneIdsByReference(grl[!strandedRanges],
grl[!is.na(mcols(grl)$GENEID)],
min.exonOverlap = min.exonOverlap,
fusionMode = FALSE)

More specifically, it was during the expandRangesList step, which seems to be caused by too many combinations (>2*10^9). Do you have any suggestions on this? Can I directly use rangeIntersect= intersect(ranges(grl[queryHits(ov)[multiHits]]),
ranges(geneRanges[subjectHits(ov)[multiHits]])) and then group by ID?

andredsim · 2023-05-02T07:42:16Z

Hi, sorry for the delayed response.

Thank you for your thorough attempt to debug it. I had a look at this part of the code and I believe your work around should result in the same outcome and should be more memory efficient too. I would need to do some bench-marking to see how this changes performance on large datasets because it could be that it takes longer to run having to do pairwise intersect checks (which is better of course than crashing due to running out memory or object size constraints!).

Another solution could be assigning gene ids to the unstranded ranges first to the much smaller annotation set first, and then do the remaining unannotated unstranded ranges to the annotated grl (opposite to how it is done now). This will need some evaluation from our end as it would change the results in an unpredictable way.

So that I can get a frame of reference would you mind letting me know how large your bam file both in gigabytes and number of reads? Also could you let me know how long grl[!strandedRanges] is.

Let us know how this goes, and I will post back here once I find some time to evaluate these changes.

HongYhong · 2023-05-04T03:22:31Z

Hi @andredsim,
The BAM file has a size of 41GB and 73 million reads.

The information for the grl[!strandedRanges] object:

length(grl[!strandedRanges])
[1] 536052

table(lengths(grl[!strandedRanges]))
1 2 3 4 5 6 7 8 9
348083 169408 16956 1452 127 18 6 1 1

andredsim · 2023-05-15T09:58:31Z

Hi,

I tested out your suggested change and it ended up being BOTH faster and more memory efficient than the ideas I had to fix it so I have incorporated it into this PR. I added a small change that makes it slightly more faster too.
#372
I imagine you already have been running it with your own code change, but in the case you haven't feel free to use this branch in the meantime.

Thank you for the report and the code suggestion.

Kind Regards,
Andre Sim

reduce memory and speed up assignGeneId from issue #368

andredsim closed this as completed May 15, 2023

andredsim added a commit that referenced this issue Jul 7, 2023

Merge pull request #372 from GoekeLab/improve_assignGeneId

1c1900e

reduce memory and speed up assignGeneId from issue #368

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in Assigning Gene IDs to Unstranded Read Clusters #368

Error in Assigning Gene IDs to Unstranded Read Clusters #368

HongYhong commented Apr 30, 2023

andredsim commented May 2, 2023

HongYhong commented May 4, 2023

andredsim commented May 15, 2023

Error in Assigning Gene IDs to Unstranded Read Clusters #368

Error in Assigning Gene IDs to Unstranded Read Clusters #368

Comments

HongYhong commented Apr 30, 2023

andredsim commented May 2, 2023

HongYhong commented May 4, 2023

andredsim commented May 15, 2023