Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change fastx parser to klib/kseq #223

Merged
merged 24 commits into from
Aug 28, 2020
Merged

Change fastx parser to klib/kseq #223

merged 24 commits into from
Aug 28, 2020

Conversation

mbhall88
Copy link
Member

@mbhall88 mbhall88 commented Aug 24, 2020

We do not handle fastx files very robustly and allow invalid files. Additionally, as mentioned in #198, we don't correctly parse the sequence identifier. This PR aims to switch fastq handling to (predominantly) klib. As much as possible, I have tried to keep API changes to FastaqHandler as minimal as possible.

Note: These changes may be incompatible with indices built with previous versions. The reason for this is that if the PRG that was indexed was

>prg1 min_match=7 max_nest=5
ACGT

then the name of the kmer PRG for that will be kmer_prgs/01/prg1\ min_match=7\ max_nest=5.k15.w14.gfa.
With the changes in this PRG, the file will now be kmer_prgs/01/prg1.k15.w14.gfa.

Added

  • is_closed method on FastaqHandler
  • kseq.h from klib

Changed

  • Some of our test fastq files are invalid (i.e. truncated quality strings etc.) as we don't pay attention to quality strings when parsing. As klib does, I have changed it so that these files have quality strings the same length as their corresponding sequence.
  • renamed get_id to get_nth_read as the original name confuses terminology (I normally consider id to be sequence identifier - which this function takes no note of).
  • get_next now throws std::out_of_range if it is called after all reads have been read. The convention now is to use this function call in a try-catch block. Code has been refactored so all current usage of this function is in one of these blocks.
  • When get_nth_read receives an index greater than the number of reads in the file, it will throw an std::out_of_range exception now. This is to keep consistent with the new functionality of get_next

Removed

  • skip_next method on FastaqHandler. This method was only being used by get_id and all it is doing is calling get_next. It seemed a bit useless and removing it didn't cause any tests to fail...
  • two print methods in fastaq_handler.cpp that were not being used anywhere
  • some log messages that don't seem useful

closes #198

@mbhall88 mbhall88 changed the title Change fastq parser to klib (kseq) Change fastx parser to klib/kseq Aug 25, 2020
@mbhall88 mbhall88 marked this pull request as ready for review August 25, 2020 06:59
@mbhall88 mbhall88 requested a review from leoisl August 25, 2020 06:59
@mbhall88
Copy link
Member Author

mbhall88 commented Aug 25, 2020

I am currently running the tip of this branch on my TB data. Some samples have finished so I guess the changes haven't caused any major errors. Might be nice if you want to test it out on a sample you have some "expected" output for? Don't worry if not, I can run the current tip of dev on one of the samples tomorrow and see if the output is "the same" as that from this new klib version. I have a docker container built for the tip of this PR at docker://mbhall88/pandora:04beed5

Copy link
Collaborator

@leoisl leoisl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this, nice PR! I made some comments, feel totally free to accept or reject the proposed changes.

I now prefer this solution to fasta/q parsing than including Seqan. It does the job very well, and it is a lot more lightweight, and the tooling overhead is almost 0, due to kseq being a single-file header-only library.

src/fastaq_handler.cpp Outdated Show resolved Hide resolved
include/fastaq_handler.h Outdated Show resolved Hide resolved
include/fastaq_handler.h Outdated Show resolved Hide resolved
src/fastaq_handler.cpp Outdated Show resolved Hide resolved
src/fastaq_handler.cpp Outdated Show resolved Hide resolved
@leoisl
Copy link
Collaborator

leoisl commented Aug 25, 2020

I am currently running the tip of this branch on my TB data. Some samples have finished so I guess the changes haven't caused any major errors. Might be nice if you want to test it out on a sample you have some "expected" output for? Don't worry if not, I can run the current tip of dev on one of the samples tomorrow and see if the output is "the same" as that from this new klib version. I have a docker container built for the tip of this PR at docker://mbhall88/pandora:04beed5

It is fine to push unstable versions to dev branch, and the version we are using in the paper is properly tagged. Your test suite is also comprehensive. And if there is any outstanding issue with fasta/q parsing, your real data test should catch it... If you get the same results as in the tip of dev, looking at the code and the test suite, I personally feel like the implementation is correct, and should be merged

@mbhall88 mbhall88 requested a review from leoisl August 26, 2020 04:50
@mbhall88
Copy link
Member Author

mbhall88 commented Aug 26, 2020

I have implemented the great suggestions @leoisl and updated the PR "changelog" accordingly. Let me know if you see any issues with my changes.

Copy link
Collaborator

@leoisl leoisl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor suggestions...
Did your real data test (i.e. getting the output of the code in this branch vs dev branch) yielded the same results? If so, I think implementation is fine!

src/fastaq_handler.cpp Outdated Show resolved Hide resolved
src/fastaq_handler.cpp Outdated Show resolved Hide resolved
src/fastaq_handler.cpp Outdated Show resolved Hide resolved
src/fastaq_handler.cpp Outdated Show resolved Hide resolved
@mbhall88 mbhall88 requested a review from leoisl August 27, 2020 03:22
@mbhall88
Copy link
Member Author

Did your real data test (i.e. getting the output of the code in this branch vs dev branch) yielded the same results? If so, I think implementation is fine!

In the process of checking this now. I'll post when I have.

@mbhall88
Copy link
Member Author

Ok. The output it exactly the same as the tip of dev 🎉

@mbhall88 mbhall88 merged commit 25d0328 into iqbal-lab-org:dev Aug 28, 2020
@mbhall88 mbhall88 mentioned this pull request Sep 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants