Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Assigning Gene IDs to Unstranded Read Clusters #368

Closed
HongYhong opened this issue Apr 30, 2023 · 3 comments
Closed

Error in Assigning Gene IDs to Unstranded Read Clusters #368

HongYhong opened this issue Apr 30, 2023 · 3 comments

Comments

@HongYhong
Copy link

Hi, thank you for contributing such a great tool. I have a large BAM sample and encountered the following problem while running it:

Error in vapply(x, NROW, integer(1)) : values must be type 'integer',
but FUN(X[[1]]) result is type 'double'

I tried to debug and found that the problem occurred when assigning gene IDs to multiple hits unstranded read clusters:

mcols(grl)$GENEID[!strandedRanges] <- assignGeneIdsByReference(grl[!strandedRanges],
grl[!is.na(mcols(grl)$GENEID)],
min.exonOverlap = min.exonOverlap,
fusionMode = FALSE)

More specifically, it was during the expandRangesList step, which seems to be caused by too many combinations (>2*10^9). Do you have any suggestions on this? Can I directly use rangeIntersect= intersect(ranges(grl[queryHits(ov)[multiHits]]),
ranges(geneRanges[subjectHits(ov)[multiHits]])) and then group by ID?

@andredsim
Copy link
Collaborator

Hi, sorry for the delayed response.

Thank you for your thorough attempt to debug it. I had a look at this part of the code and I believe your work around should result in the same outcome and should be more memory efficient too. I would need to do some bench-marking to see how this changes performance on large datasets because it could be that it takes longer to run having to do pairwise intersect checks (which is better of course than crashing due to running out memory or object size constraints!).

Another solution could be assigning gene ids to the unstranded ranges first to the much smaller annotation set first, and then do the remaining unannotated unstranded ranges to the annotated grl (opposite to how it is done now). This will need some evaluation from our end as it would change the results in an unpredictable way.

So that I can get a frame of reference would you mind letting me know how large your bam file both in gigabytes and number of reads? Also could you let me know how long grl[!strandedRanges] is.

Let us know how this goes, and I will post back here once I find some time to evaluate these changes.

@HongYhong
Copy link
Author

Hi @andredsim,
The BAM file has a size of 41GB and 73 million reads.

The information for the grl[!strandedRanges] object:

length(grl[!strandedRanges])
[1] 536052

table(lengths(grl[!strandedRanges]))
1 2 3 4 5 6 7 8 9
348083 169408 16956 1452 127 18 6 1 1

@andredsim
Copy link
Collaborator

Hi,

I tested out your suggested change and it ended up being BOTH faster and more memory efficient than the ideas I had to fix it so I have incorporated it into this PR. I added a small change that makes it slightly more faster too.
#372
I imagine you already have been running it with your own code change, but in the case you haven't feel free to use this branch in the meantime.

Thank you for the report and the code suggestion.

Kind Regards,
Andre Sim

andredsim added a commit that referenced this issue Jul 7, 2023
reduce memory and speed up assignGeneId from issue #368
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants