You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current protocol uses pandas, which is pretty memory intensive, and probably won't scale amazingly, unless swapping to some other pandas big data version protocol thing. I think it may be best to just use seqkit. Currently the expanded nonredundant file has ,pephash,sample,contig,start,stop,strand,allStandardAA,seq, this can be handled usin seqkit fx2tab with seq-hash, then plug that into seqkit tab2fx with the hash as the header. What we have is fine for now, but this will definitely be needed when scaling. Honestly, at that point we should probably also use seqkit to split the nr data into max_threads number of files for parallelization (though that is an entirely different issue)
The text was updated successfully, but these errors were encountered:
The current protocol uses pandas, which is pretty memory intensive, and probably won't scale amazingly, unless swapping to some other pandas big data version protocol thing. I think it may be best to just use seqkit. Currently the expanded nonredundant file has
,pephash,sample,contig,start,stop,strand,allStandardAA,seq
, this can be handled usin seqkit fx2tab with seq-hash, then plug that into seqkit tab2fx with the hash as the header. What we have is fine for now, but this will definitely be needed when scaling. Honestly, at that point we should probably also use seqkit to split the nr data into max_threads number of files for parallelization (though that is an entirely different issue)The text was updated successfully, but these errors were encountered: