Change fastx parser to klib/kseq #223

mbhall88 · 2020-08-24T05:40:54Z

We do not handle fastx files very robustly and allow invalid files. Additionally, as mentioned in #198, we don't correctly parse the sequence identifier. This PR aims to switch fastq handling to (predominantly) klib. As much as possible, I have tried to keep API changes to FastaqHandler as minimal as possible.

Note: These changes may be incompatible with indices built with previous versions. The reason for this is that if the PRG that was indexed was

>prg1 min_match=7 max_nest=5
ACGT

then the name of the kmer PRG for that will be kmer_prgs/01/prg1\ min_match=7\ max_nest=5.k15.w14.gfa.
With the changes in this PRG, the file will now be kmer_prgs/01/prg1.k15.w14.gfa.

Added

is_closed method on FastaqHandler
kseq.h from klib

Changed

Some of our test fastq files are invalid (i.e. truncated quality strings etc.) as we don't pay attention to quality strings when parsing. As klib does, I have changed it so that these files have quality strings the same length as their corresponding sequence.
renamed get_id to get_nth_read as the original name confuses terminology (I normally consider id to be sequence identifier - which this function takes no note of).
get_next now throws std::out_of_range if it is called after all reads have been read. The convention now is to use this function call in a try-catch block. Code has been refactored so all current usage of this function is in one of these blocks.
When get_nth_read receives an index greater than the number of reads in the file, it will throw an std::out_of_range exception now. This is to keep consistent with the new functionality of get_next

Removed

skip_next method on FastaqHandler. This method was only being used by get_id and all it is doing is calling get_next. It seemed a bit useless and removing it didn't cause any tests to fail...
two print methods in fastaq_handler.cpp that were not being used anywhere
some log messages that don't seem useful

closes #198

Otherwise it won't compile

Klib simplified

mbhall88 · 2020-08-25T07:02:03Z

I am currently running the tip of this branch on my TB data. Some samples have finished so I guess the changes haven't caused any major errors. Might be nice if you want to test it out on a sample you have some "expected" output for? Don't worry if not, I can run the current tip of dev on one of the samples tomorrow and see if the output is "the same" as that from this new klib version. I have a docker container built for the tip of this PR at docker://mbhall88/pandora:04beed5

leoisl

Thanks a lot for this, nice PR! I made some comments, feel totally free to accept or reject the proposed changes.

I now prefer this solution to fasta/q parsing than including Seqan. It does the job very well, and it is a lot more lightweight, and the tooling overhead is almost 0, due to kseq being a single-file header-only library.

src/fastaq_handler.cpp

include/fastaq_handler.h

src/fastaq_handler.cpp

leoisl · 2020-08-25T16:13:52Z

I am currently running the tip of this branch on my TB data. Some samples have finished so I guess the changes haven't caused any major errors. Might be nice if you want to test it out on a sample you have some "expected" output for? Don't worry if not, I can run the current tip of dev on one of the samples tomorrow and see if the output is "the same" as that from this new klib version. I have a docker container built for the tip of this PR at docker://mbhall88/pandora:04beed5

It is fine to push unstable versions to dev branch, and the version we are using in the paper is properly tagged. Your test suite is also comprehensive. And if there is any outstanding issue with fasta/q parsing, your real data test should catch it... If you get the same results as in the tip of dev, looking at the code and the test suite, I personally feel like the implementation is correct, and should be merged

mbhall88 · 2020-08-26T04:51:10Z

I have implemented the great suggestions @leoisl and updated the PR "changelog" accordingly. Let me know if you see any issues with my changes.

leoisl

Just minor suggestions...
Did your real data test (i.e. getting the output of the code in this branch vs dev branch) yielded the same results? If so, I think implementation is fine!

src/fastaq_handler.cpp

mbhall88 · 2020-08-27T03:22:39Z

Did your real data test (i.e. getting the output of the code in this branch vs dev branch) yielded the same results? If so, I think implementation is fine!

In the process of checking this now. I'll post when I have.

mbhall88 · 2020-08-27T04:10:23Z

Ok. The output it exactly the same as the tip of dev 🎉

mbhall88 and others added 18 commits August 21, 2020 09:56

update index usage in readme

b66aa27

update map usage in readme

74d9a19

correct filepath for estimate parameters test cases

b67fb84

fix incorrect relative test case directory path

67d0128

WIP: move fastq handling to kseq

da7f8aa

throw exception if id greater than num reads is given to get_id

303bf37

add is_closed method and fix test fastq

21cd913

Adding closed_status and is_closed() to FastaqHandler

dcf6dc1

Otherwise it won't compile

test_cases/ -> ../../test/test_cases/

b5a97f4

Adding an EOF test

569c849

Bugfix on FastaqHandler::eof()

cd757b8

Bugfix on FastaqHandler::get_next()

4b9274f

Merge pull request #2 from leoisl/klib_simplified

39f6e6b

Klib simplified

update ignores

1083033

remove gzipped member variable

78e3451

ensure file cannot be closed twice

86e92f0

remove skip_next function

610cc1a

remove unused includes and functions

04beed5

mbhall88 changed the title ~~Change fastq parser to klib (kseq)~~ Change fastx parser to klib/kseq Aug 25, 2020

mbhall88 marked this pull request as ready for review August 25, 2020 06:59

mbhall88 requested a review from leoisl August 25, 2020 06:59

leoisl requested changes Aug 25, 2020

View reviewed changes

mbhall88 added 4 commits August 26, 2020 11:45

copy filepath instead of using reference

19d25e0

throw exception when get_next after eof

8090a08

rename and refactor get_id

1198b0e

more robust fastx file closing

47b8e8c

mbhall88 requested a review from leoisl August 26, 2020 04:50

leoisl requested changes Aug 26, 2020

View reviewed changes

src/fastaq_handler.cpp Outdated Show resolved Hide resolved

src/fastaq_handler.cpp Outdated Show resolved Hide resolved

src/fastaq_handler.cpp Outdated Show resolved Hide resolved

src/fastaq_handler.cpp Outdated Show resolved Hide resolved

mbhall88 added 2 commits August 27, 2020 13:17

throw exceptions not other types

40b6d22

remove redundant try-catch

e276ac1

mbhall88 requested a review from leoisl August 27, 2020 03:22

leoisl approved these changes Aug 27, 2020

View reviewed changes

mbhall88 merged commit 25d0328 into iqbal-lab-org:dev Aug 28, 2020

mbhall88 mentioned this pull request Sep 21, 2020

Integrate SeqAn #222

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change fastx parser to klib/kseq #223

Change fastx parser to klib/kseq #223

mbhall88 commented Aug 24, 2020 •

edited

Loading

mbhall88 commented Aug 25, 2020 •

edited

Loading

leoisl left a comment

leoisl commented Aug 25, 2020

mbhall88 commented Aug 26, 2020 •

edited

Loading

leoisl left a comment

mbhall88 commented Aug 27, 2020

mbhall88 commented Aug 27, 2020

Change fastx parser to klib/kseq #223

Change fastx parser to klib/kseq #223

Conversation

mbhall88 commented Aug 24, 2020 • edited Loading

Added

Changed

Removed

mbhall88 commented Aug 25, 2020 • edited Loading

leoisl left a comment

Choose a reason for hiding this comment

leoisl commented Aug 25, 2020

mbhall88 commented Aug 26, 2020 • edited Loading

leoisl left a comment

Choose a reason for hiding this comment

mbhall88 commented Aug 27, 2020

mbhall88 commented Aug 27, 2020

mbhall88 commented Aug 24, 2020 •

edited

Loading

mbhall88 commented Aug 25, 2020 •

edited

Loading

mbhall88 commented Aug 26, 2020 •

edited

Loading