-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in Assigning Gene IDs to Unstranded Read Clusters #368
Comments
Hi, sorry for the delayed response. Thank you for your thorough attempt to debug it. I had a look at this part of the code and I believe your work around should result in the same outcome and should be more memory efficient too. I would need to do some bench-marking to see how this changes performance on large datasets because it could be that it takes longer to run having to do pairwise intersect checks (which is better of course than crashing due to running out memory or object size constraints!). Another solution could be assigning gene ids to the unstranded ranges first to the much smaller annotation set first, and then do the remaining unannotated unstranded ranges to the annotated grl (opposite to how it is done now). This will need some evaluation from our end as it would change the results in an unpredictable way. So that I can get a frame of reference would you mind letting me know how large your bam file both in gigabytes and number of reads? Also could you let me know how long grl[!strandedRanges] is. Let us know how this goes, and I will post back here once I find some time to evaluate these changes. |
Hi @andredsim, The information for the grl[!strandedRanges] object:
|
Hi, I tested out your suggested change and it ended up being BOTH faster and more memory efficient than the ideas I had to fix it so I have incorporated it into this PR. I added a small change that makes it slightly more faster too. Thank you for the report and the code suggestion. Kind Regards, |
reduce memory and speed up assignGeneId from issue #368
Hi, thank you for contributing such a great tool. I have a large BAM sample and encountered the following problem while running it:
Error in vapply(x, NROW, integer(1)) : values must be type 'integer',
but FUN(X[[1]]) result is type 'double'
I tried to debug and found that the problem occurred when assigning gene IDs to multiple hits unstranded read clusters:
mcols(grl)$GENEID[!strandedRanges] <- assignGeneIdsByReference(grl[!strandedRanges],
grl[!is.na(mcols(grl)$GENEID)],
min.exonOverlap = min.exonOverlap,
fusionMode = FALSE)
More specifically, it was during the expandRangesList step, which seems to be caused by too many combinations (>2*10^9). Do you have any suggestions on this? Can I directly use rangeIntersect= intersect(ranges(grl[queryHits(ov)[multiHits]]),
ranges(geneRanges[subjectHits(ov)[multiHits]])) and then group by ID?
The text was updated successfully, but these errors were encountered: