Issue: sample size of more than 100000 sequences #40

kgrigaityte · 2019-03-04T01:48:17Z

Hello,

I'm trying to run Igor on my T cell receptor beta chain sequences, and everything works great until my sample size is above 100,000 sequences.

I'm getting the following error when using -evaluate command:

[IGoR] ERROR: Exception caught while reading J alignments before inference/evaluation. Make sure alignments were carried previously using "-align --J" or "-align --all" with similar path parameters (working directory, batchname, ...)

I have done -align -all, just like i did for all my other samples, and the the J_alignments file was generated in the aligns folder and looks fine. I tried splitting the sample in 4 files and doing all separately, which worked perfectly, so it shouldn't be a problem with the sequences. It is only when I use the whole file that I get that error.

Do you have any advise on how to go around this, or are there limitations with file sizes?

Thanks,
Kristina

qmarcou · 2019-03-05T08:19:55Z

Hello @kgrigaityte ,
For now IGoR is loading all alignments in memory and store them there, I guess this strategy problematic upon running over large alignment files. You should have a second line in the error message giving you the error type. Could you please paste the complete error message (or just edit your post with the complete error message) ?
There is a tradeoff between having to browse a large alignment file for every sequence on the fly (use virtually no memory but imposes to parse the complete file for each sequence) and storing every alignment in memory (uses a lot of memory and only parse the alignment file once).
In order to reduce memory usage there are two paths you could exploit:

have a more drastic filtering on alignments upon aligning or reading alignments, by playing with alignment score thresholds or relative score thresholds (although now that I think about it I am not sure I have created a command line option for the latter yet).
try and shorten your gene names (if you're using the IMGT complete name, the string will take up a lot of memory compared to a shorter name). This may sound silly but may be a real problem for large sequence sets.
I'm a bit busy at the moment but I'll try and spend some time find a better tradeoff in terms of input reading for large dataset once I get some time
Hope this helps!

qmarcou added the bug label Mar 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue: sample size of more than 100000 sequences #40

Issue: sample size of more than 100000 sequences #40

kgrigaityte commented Mar 4, 2019

qmarcou commented Mar 5, 2019

Issue: sample size of more than 100000 sequences #40

Issue: sample size of more than 100000 sequences #40

Comments

kgrigaityte commented Mar 4, 2019

qmarcou commented Mar 5, 2019