Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change fastx parser to klib/kseq #223

Merged
merged 24 commits into from
Aug 28, 2020
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
b66aa27
update index usage in readme
mbhall88 Aug 20, 2020
74d9a19
update map usage in readme
mbhall88 Aug 21, 2020
b67fb84
correct filepath for estimate parameters test cases
mbhall88 Aug 21, 2020
67d0128
fix incorrect relative test case directory path
mbhall88 Aug 21, 2020
da7f8aa
WIP: move fastq handling to kseq
mbhall88 Aug 21, 2020
303bf37
throw exception if id greater than num reads is given to get_id
mbhall88 Aug 24, 2020
21cd913
add is_closed method and fix test fastq
mbhall88 Aug 24, 2020
dcf6dc1
Adding closed_status and is_closed() to FastaqHandler
Aug 24, 2020
b5a97f4
test_cases/ -> ../../test/test_cases/
Aug 24, 2020
569c849
Adding an EOF test
Aug 24, 2020
cd757b8
Bugfix on FastaqHandler::eof()
Aug 24, 2020
4b9274f
Bugfix on FastaqHandler::get_next()
Aug 24, 2020
39f6e6b
Merge pull request #2 from leoisl/klib_simplified
mbhall88 Aug 25, 2020
1083033
update ignores
mbhall88 Aug 25, 2020
78e3451
remove gzipped member variable
mbhall88 Aug 25, 2020
86e92f0
ensure file cannot be closed twice
mbhall88 Aug 25, 2020
610cc1a
remove skip_next function
mbhall88 Aug 25, 2020
04beed5
remove unused includes and functions
mbhall88 Aug 25, 2020
19d25e0
copy filepath instead of using reference
mbhall88 Aug 26, 2020
8090a08
throw exception when get_next after eof
mbhall88 Aug 26, 2020
1198b0e
rename and refactor get_id
mbhall88 Aug 26, 2020
47b8e8c
more robust fastx file closing
mbhall88 Aug 26, 2020
40b6d22
throw exceptions not other types
mbhall88 Aug 27, 2020
e276ac1
remove redundant try-catch
mbhall88 Aug 27, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -79,3 +79,19 @@ docker/
mounted_folder/

docker_dev/
CMakeCache.txt
CMakeFiles
CTestTestfile.cmake
ext/gatb.tar.gz
gatb-prefix
ext/gatb-core-1.4.1
gtest
include/gatb
include/hdf5
include/json
include/BooPHF
pandora.cbp
test/cmake_install.cmake
test/estimate_parameters_test
test/pandora_test
coverage
65 changes: 40 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,40 +99,55 @@ If you haven't, you will need a multiple sequence alignment for each graph. Prec
### Build index
Takes a fasta-like file of PanRG sequences and constructs an index, and a directory of gfa files to be used by `pandora map` or `pandora compare`. These are output in the same directory as the PanRG file.

Usage: pandora index [options] <PanRG>
Options:
```
$ pandora index --help
Usage: pandora index [options] <prgs.fa>
Options:
-h,--help Show this help message
-w W Window size for (w,k)-minimizers, default 14
-k K K-mer size for (w,k)-minimizers, default 15
-t T Number of concurrent threads, default 1
-t T Number of threads, default 1
--offset Offset for PRG ids, default 0
--outfile Filename for index
--log_level debug,[info],warning,error
```

The index stores (w,k)-minimizers for each PanRG path found. These parameters can be specified, but default to w=14, k=15.

### Map reads to index
This takes a fasta/q of Nanopore or Illumina reads and compares to the index. It infers which of the PanRG genes/elements is present, and for those that are present it outputs the inferred sequence and a genotyped VCF.

Usage: pandora map -p PanRG_FILE -r READ_FILE -o OUTDIR <option(s)>
Options:
-h,--help Show this help message
-p,--prg_file PanRG_FILE Specify a fasta-style PanRG file
-r,--read_file READ_FILE Specify a file of reads in fasta/q format
-o,--outdir OUTDIR Specify directory of output
-w W Window size for (w,k)-minimizers, must be <=k, default 14
-k K K-mer size for (w,k)-minimizers, default 15
-m,--max_diff INT Maximum distance between consecutive hits within a cluster, default 250 bps
-e,--error_rate FLOAT Estimated error rate for reads, default 0.11/0.001 for Nanopore/Illumina
-c,--min_cluster_size INT Minimum number of hits in a cluster to consider a locus present, default 10
--genome_size NUM_BP Estimated length of genome, used for coverage estimation, default 5000000
--vcf_refs REF_FASTA A fasta file with an entry for each loci in the PanRG in order, giving
reference sequence to be used as VCF ref. Must have a perfect match to a
path in the graph and the same name as the locus in the graph.
--illumina Data is from Illumina, not Nanopore, so is shorter with low error rate
--bin Use binomial model for kmer coverages, default is negative binomial
--max_covg INT Maximum average coverage from reads to accept, default first 300
--genotype Output a genotyped VCF
--discover Add denovo discovery
--denovo_kmer_size INT Kmer size to use for denovo discovery, default 11
--log_level LEVEL Verbosity for logging, use "debug" for more output
```
$ pandora map --help
Usage: pandora map -p PRG_FILE -r READ_FILE <option(s)>
Options:
-h,--help Show this help message
-p,--prg_file PRG_FILE Specify a fasta-style prg file
-r,--read_file READ_FILE Specify a file of reads in fasta format
-o,--outdir OUTDIR Specify directory of output
-w W Window size for (w,k)-minimizers, must be <=k, default 14
-k K K-mer size for (w,k)-minimizers, default 15
-m,--max_diff INT Maximum distance between consecutive hits within a cluster, default 500 (bps)
-e,--error_rate FLOAT Estimated error rate for reads, default 0.11
-t T Number of threads, default 1
--genome_size NUM_BP Estimated length of genome, used for coverage estimation
--output_kg Save kmer graphs with fwd and rev coverage annotations for found localPRGs
--output_vcf Save a vcf file for each found localPRG
--vcf_refs REF_FASTA A fasta file with an entry for each LocalPRG giving reference sequence for
VCF. Must have a perfect match in the graph and the same name as the graph
--output_comparison_paths Save a fasta file for a random selection of paths through localPRG
--output_covgs Save a file of covgs for each localPRG present, one number per base of fasta file
--output_mapped_read_fa Save a file for each gene containing read parts which overlapped it
--illumina Data is from illumina rather than nanopore, so is shorter with low error rate
--clean Add a step to clean and detangle the pangraph
--bin Use binomial model for kmer coverages, default is negative binomial
--max_covg Maximum average coverage from reads to accept
--genotype MODE Add extra step to carefully genotype sites. Has two modes: global (ML path oriented) or local (coverage oriented)
--snps_only When genotyping, include only snp sites
-d,--discover Add denovo discovery
--denovo_kmer_size Kmer size to use for denovo discovery
--log_level debug,[info],warning,error
```

### Compare reads from several samples
This takes Nanopore or Illumina read fasta/q for a number of samples, mapping each to the index. It infers which of the PanRG genes/elements is present in each sample, and outputs a presence/absence pangenome matrix, the inferred sequences for each sample and a genotyped multisample pangenome VCF.
Expand Down
28 changes: 18 additions & 10 deletions include/fastaq_handler.h
Original file line number Diff line number Diff line change
Expand Up @@ -4,36 +4,44 @@
#include <string>
#include <cstdint>
#include <fstream>
#include <sstream>
#include <exception>
#include <boost/iostreams/filtering_streambuf.hpp>
#include <boost/log/core.hpp>
#include <boost/log/trivial.hpp>
#include <boost/log/expressions.hpp>
#include <cstdio>
#include <zlib.h>
#include "kseq.h"
KSEQ_INIT(gzFile, gzread)

namespace logging = boost::log;

struct FastaqHandler {
bool gzipped;
std::ifstream fastaq_file;
boost::iostreams::filtering_istreambuf inbuf;
std::istream instream;
std::string line;
private:
bool closed;
kseq_t* inbuf;

public:
const std::string filepath;
gzFile fastaq_file;
std::string name;
std::string read;
uint32_t num_reads_parsed;

FastaqHandler(const std::string&);
FastaqHandler(const std::string);

~FastaqHandler();

bool eof();
bool eof() const;

void get_next();

void skip_next();

void get_id(const uint32_t&);
void get_nth_read(const uint32_t& idx);

void close();

bool is_closed() const;
};

#endif
242 changes: 242 additions & 0 deletions include/kseq.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
/* The MIT License

Copyright (c) 2008, 2009, 2011 Attractive Chaos <attractor@live.co.uk>

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

/* Last Modified: 05MAR2012 */

#ifndef AC_KSEQ_H
#define AC_KSEQ_H

#include <ctype.h>
#include <string.h>
#include <stdlib.h>

#define KS_SEP_SPACE 0 // isspace(): \t, \n, \v, \f, \r
#define KS_SEP_TAB 1 // isspace() && !' '
#define KS_SEP_LINE 2 // line separator: "\n" (Unix) or "\r\n" (Windows)
#define KS_SEP_MAX 2

#define __KS_TYPE(type_t) \
typedef struct __kstream_t { \
unsigned char *buf; \
int begin, end, is_eof; \
type_t f; \
} kstream_t;

#define ks_err(ks) ((ks)->end == -1)
#define ks_eof(ks) ((ks)->is_eof && (ks)->begin >= (ks)->end)
#define ks_rewind(ks) ((ks)->is_eof = (ks)->begin = (ks)->end = 0)

#define __KS_BASIC(type_t, __bufsize) \
static inline kstream_t *ks_init(type_t f) \
{ \
kstream_t *ks = (kstream_t*)calloc(1, sizeof(kstream_t)); \
ks->f = f; \
ks->buf = (unsigned char*)malloc(__bufsize); \
return ks; \
} \
static inline void ks_destroy(kstream_t *ks) \
{ \
if (ks) { \
free(ks->buf); \
free(ks); \
} \
}

#define __KS_GETC(__read, __bufsize) \
static inline int ks_getc(kstream_t *ks) \
{ \
if (ks_err(ks)) return -3; \
if (ks->is_eof && ks->begin >= ks->end) return -1; \
if (ks->begin >= ks->end) { \
ks->begin = 0; \
ks->end = __read(ks->f, ks->buf, __bufsize); \
if (ks->end == 0) { ks->is_eof = 1; return -1;} \
if (ks->end == -1) { ks->is_eof = 1; return -3;}\
} \
return (int)ks->buf[ks->begin++]; \
}

#ifndef KSTRING_T
#define KSTRING_T kstring_t
typedef struct __kstring_t {
size_t l, m;
char *s;
} kstring_t;
#endif

#ifndef kroundup32
#define kroundup32(x) (--(x), (x)|=(x)>>1, (x)|=(x)>>2, (x)|=(x)>>4, (x)|=(x)>>8, (x)|=(x)>>16, ++(x))
#endif

#define __KS_GETUNTIL(__read, __bufsize) \
static int ks_getuntil2(kstream_t *ks, int delimiter, kstring_t *str, int *dret, int append) \
{ \
int gotany = 0; \
if (dret) *dret = 0; \
str->l = append? str->l : 0; \
for (;;) { \
int i; \
if (ks_err(ks)) return -3; \
if (ks->begin >= ks->end) { \
if (!ks->is_eof) { \
ks->begin = 0; \
ks->end = __read(ks->f, ks->buf, __bufsize); \
if (ks->end == 0) { ks->is_eof = 1; break; } \
if (ks->end == -1) { ks->is_eof = 1; return -3; } \
} else break; \
} \
if (delimiter == KS_SEP_LINE) { \
for (i = ks->begin; i < ks->end; ++i) \
if (ks->buf[i] == '\n') break; \
} else if (delimiter > KS_SEP_MAX) { \
for (i = ks->begin; i < ks->end; ++i) \
if (ks->buf[i] == delimiter) break; \
} else if (delimiter == KS_SEP_SPACE) { \
for (i = ks->begin; i < ks->end; ++i) \
if (isspace(ks->buf[i])) break; \
} else if (delimiter == KS_SEP_TAB) { \
for (i = ks->begin; i < ks->end; ++i) \
if (isspace(ks->buf[i]) && ks->buf[i] != ' ') break; \
} else i = 0; /* never come to here! */ \
if (str->m - str->l < (size_t)(i - ks->begin + 1)) { \
str->m = str->l + (i - ks->begin) + 1; \
kroundup32(str->m); \
str->s = (char*)realloc(str->s, str->m); \
} \
gotany = 1; \
memcpy(str->s + str->l, ks->buf + ks->begin, i - ks->begin); \
str->l = str->l + (i - ks->begin); \
ks->begin = i + 1; \
if (i < ks->end) { \
if (dret) *dret = ks->buf[i]; \
break; \
} \
} \
if (!gotany && ks_eof(ks)) return -1; \
if (str->s == 0) { \
str->m = 1; \
str->s = (char*)calloc(1, 1); \
} else if (delimiter == KS_SEP_LINE && str->l > 1 && str->s[str->l-1] == '\r') --str->l; \
str->s[str->l] = '\0'; \
return str->l; \
} \
static inline int ks_getuntil(kstream_t *ks, int delimiter, kstring_t *str, int *dret) \
{ return ks_getuntil2(ks, delimiter, str, dret, 0); }

#define KSTREAM_INIT(type_t, __read, __bufsize) \
__KS_TYPE(type_t) \
__KS_BASIC(type_t, __bufsize) \
__KS_GETC(__read, __bufsize) \
__KS_GETUNTIL(__read, __bufsize)

#define kseq_rewind(ks) ((ks)->last_char = (ks)->f->is_eof = (ks)->f->begin = (ks)->f->end = 0)

#define __KSEQ_BASIC(SCOPE, type_t) \
SCOPE kseq_t *kseq_init(type_t fd) \
{ \
kseq_t *s = (kseq_t*)calloc(1, sizeof(kseq_t)); \
s->f = ks_init(fd); \
return s; \
} \
SCOPE void kseq_destroy(kseq_t *ks) \
{ \
if (!ks) return; \
free(ks->name.s); free(ks->comment.s); free(ks->seq.s); free(ks->qual.s); \
ks_destroy(ks->f); \
free(ks); \
}

/* Return value:
>=0 length of the sequence (normal)
-1 end-of-file
-2 truncated quality string
-3 error reading stream
*/
#define __KSEQ_READ(SCOPE) \
SCOPE int kseq_read(kseq_t *seq) \
{ \
int c,r; \
kstream_t *ks = seq->f; \
if (seq->last_char == 0) { /* then jump to the next header line */ \
while ((c = ks_getc(ks)) >= 0 && c != '>' && c != '@'); \
if (c < 0) return c; /* end of file or error*/ \
seq->last_char = c; \
} /* else: the first header char has been read in the previous call */ \
seq->comment.l = seq->seq.l = seq->qual.l = 0; /* reset all members */ \
if ((r=ks_getuntil(ks, 0, &seq->name, &c)) < 0) return r; /* normal exit: EOF or error */ \
if (c != '\n') ks_getuntil(ks, KS_SEP_LINE, &seq->comment, 0); /* read FASTA/Q comment */ \
if (seq->seq.s == 0) { /* we can do this in the loop below, but that is slower */ \
seq->seq.m = 256; \
seq->seq.s = (char*)malloc(seq->seq.m); \
} \
while ((c = ks_getc(ks)) >= 0 && c != '>' && c != '+' && c != '@') { \
if (c == '\n') continue; /* skip empty lines */ \
seq->seq.s[seq->seq.l++] = c; /* this is safe: we always have enough space for 1 char */ \
ks_getuntil2(ks, KS_SEP_LINE, &seq->seq, 0, 1); /* read the rest of the line */ \
} \
if (c == '>' || c == '@') seq->last_char = c; /* the first header char has been read */ \
if (seq->seq.l + 1 >= seq->seq.m) { /* seq->seq.s[seq->seq.l] below may be out of boundary */ \
seq->seq.m = seq->seq.l + 2; \
kroundup32(seq->seq.m); /* rounded to the next closest 2^k */ \
seq->seq.s = (char*)realloc(seq->seq.s, seq->seq.m); \
} \
seq->seq.s[seq->seq.l] = 0; /* null terminated string */ \
if (c != '+') return seq->seq.l; /* FASTA */ \
if (seq->qual.m < seq->seq.m) { /* allocate memory for qual in case insufficient */ \
seq->qual.m = seq->seq.m; \
seq->qual.s = (char*)realloc(seq->qual.s, seq->qual.m); \
} \
while ((c = ks_getc(ks)) >= 0 && c != '\n'); /* skip the rest of '+' line */ \
if (c == -1) return -2; /* error: no quality string */ \
while ((c = ks_getuntil2(ks, KS_SEP_LINE, &seq->qual, 0, 1) >= 0 && seq->qual.l < seq->seq.l)); \
if (c == -3) return -3; /* stream error */ \
seq->last_char = 0; /* we have not come to the next header line */ \
if (seq->seq.l != seq->qual.l) return -2; /* error: qual string is of a different length */ \
return seq->seq.l; \
}

#define __KSEQ_TYPE(type_t) \
typedef struct { \
kstring_t name, comment, seq, qual; \
int last_char; \
kstream_t *f; \
} kseq_t;

#define KSEQ_INIT2(SCOPE, type_t, __read) \
KSTREAM_INIT(type_t, __read, 16384) \
__KSEQ_TYPE(type_t) \
__KSEQ_BASIC(SCOPE, type_t) \
__KSEQ_READ(SCOPE)

#define KSEQ_INIT(type_t, __read) KSEQ_INIT2(static, type_t, __read)

#define KSEQ_DECLARE(type_t) \
__KS_TYPE(type_t) \
__KSEQ_TYPE(type_t) \
extern kseq_t *kseq_init(type_t fd); \
void kseq_destroy(kseq_t *ks); \
int kseq_read(kseq_t *seq);

#endif
Loading