Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: sample size of more than 100000 sequences #40

Open
kgrigaityte opened this issue Mar 4, 2019 · 1 comment
Open

Issue: sample size of more than 100000 sequences #40

kgrigaityte opened this issue Mar 4, 2019 · 1 comment
Labels

Comments

@kgrigaityte
Copy link

Hello,

I'm trying to run Igor on my T cell receptor beta chain sequences, and everything works great until my sample size is above 100,000 sequences.

I'm getting the following error when using -evaluate command:

[IGoR] ERROR: Exception caught while reading J alignments before inference/evaluation. Make sure alignments were carried previously using "-align --J" or "-align --all" with similar path parameters (working directory, batchname, ...)

I have done -align -all, just like i did for all my other samples, and the the J_alignments file was generated in the aligns folder and looks fine. I tried splitting the sample in 4 files and doing all separately, which worked perfectly, so it shouldn't be a problem with the sequences. It is only when I use the whole file that I get that error.

Do you have any advise on how to go around this, or are there limitations with file sizes?

Thanks,
Kristina

@qmarcou
Copy link
Owner

qmarcou commented Mar 5, 2019

Hello @kgrigaityte ,
For now IGoR is loading all alignments in memory and store them there, I guess this strategy problematic upon running over large alignment files. You should have a second line in the error message giving you the error type. Could you please paste the complete error message (or just edit your post with the complete error message) ?
There is a tradeoff between having to browse a large alignment file for every sequence on the fly (use virtually no memory but imposes to parse the complete file for each sequence) and storing every alignment in memory (uses a lot of memory and only parse the alignment file once).
In order to reduce memory usage there are two paths you could exploit:

  • have a more drastic filtering on alignments upon aligning or reading alignments, by playing with alignment score thresholds or relative score thresholds (although now that I think about it I am not sure I have created a command line option for the latter yet).
  • try and shorten your gene names (if you're using the IMGT complete name, the string will take up a lot of memory compared to a shorter name). This may sound silly but may be a real problem for large sequence sets.
    I'm a bit busy at the moment but I'll try and spend some time find a better tradeoff in terms of input reading for large dataset once I get some time
    Hope this helps!

@qmarcou qmarcou added the bug label Mar 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants