-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change fastx parser to klib/kseq #223
Conversation
Otherwise it won't compile
Klib simplified
I am currently running the tip of this branch on my TB data. Some samples have finished so I guess the changes haven't caused any major errors. Might be nice if you want to test it out on a sample you have some "expected" output for? Don't worry if not, I can run the current tip of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this, nice PR! I made some comments, feel totally free to accept or reject the proposed changes.
I now prefer this solution to fasta/q parsing than including Seqan. It does the job very well, and it is a lot more lightweight, and the tooling overhead is almost 0, due to kseq
being a single-file header-only library.
It is fine to push unstable versions to dev branch, and the version we are using in the paper is properly tagged. Your test suite is also comprehensive. And if there is any outstanding issue with fasta/q parsing, your real data test should catch it... If you get the same results as in the tip of |
I have implemented the great suggestions @leoisl and updated the PR "changelog" accordingly. Let me know if you see any issues with my changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor suggestions...
Did your real data test (i.e. getting the output of the code in this branch vs dev
branch) yielded the same results? If so, I think implementation is fine!
In the process of checking this now. I'll post when I have. |
Ok. The output it exactly the same as the tip of |
We do not handle fastx files very robustly and allow invalid files. Additionally, as mentioned in #198, we don't correctly parse the sequence identifier. This PR aims to switch fastq handling to (predominantly)
klib
. As much as possible, I have tried to keep API changes toFastaqHandler
as minimal as possible.Note: These changes may be incompatible with indices built with previous versions. The reason for this is that if the PRG that was indexed was
then the name of the kmer PRG for that will be
kmer_prgs/01/prg1\ min_match=7\ max_nest=5.k15.w14.gfa
.With the changes in this PRG, the file will now be
kmer_prgs/01/prg1.k15.w14.gfa
.Added
is_closed
method onFastaqHandler
kseq.h
fromklib
Changed
get_id
toget_nth_read
as the original name confuses terminology (I normally consider id to be sequence identifier - which this function takes no note of).get_next
now throwsstd::out_of_range
if it is called after all reads have been read. The convention now is to use this function call in atry-catch
block. Code has been refactored so all current usage of this function is in one of these blocks.get_nth_read
receives an index greater than the number of reads in the file, it will throw anstd::out_of_range
exception now. This is to keep consistent with the new functionality ofget_next
Removed
skip_next
method onFastaqHandler
. This method was only being used byget_id
and all it is doing is callingget_next
. It seemed a bit useless and removing it didn't cause any tests to fail...print
methods infastaq_handler.cpp
that were not being used anywherecloses #198