Project for general text/ngram learning tools

TODO

Use sqlite to load/store data rather than good 'ol stdio
Refactor
- json -> sqlite
- tokenizer -> sqlite
- frame/ngram picker (use term keys in sqlite)
Embedding
Clustering

This now on hold until above re-worked

Input docs as json lines
- lazy stream or one line/doc at a time? latter done.
- check performance of strictness annotations in document record etc.
tokenize
- capitalization (char-rnn annotation trick?), punctuation, numbers,
- light touch done. Going with word-RNN for now.
Frames TODO
- output direclty to frametrain if using sdr pro-tem re-engineer later.
- What does text2vec do?
EMBEDDINGS text2vec vs. SDR
- why is text2vec hand-wavy about semantic locality
- what does text2vec do?
- SDR is clear => use SDR with downsampling SDRs/or sparse rep as they are very big input space.
  - TODO compare metrics with downsampled vectors also space requirements w.r.t. typical one-hot encoding.
  - Can we form a relational category with partial order over distance/overlap metric?
TERMS rare vs. common words filtering/folding?
LSTM word sequence model
- char-RNN vs. word-RNN word on the street is word for performance and profit.
- training time on example is a wee bit pedestrian on e.g. Shakespeare corpus but we only do that once.
- could just chuck some other narrative at it and see what we get it's good way of evidencing the language model.
- will need to mess with hyper-parameters and network design of course.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Data/Text		Data/Text
Document		Document
LICENSE		LICENSE
README.md		README.md
fruit.cabal		fruit.cabal
loadgrams.hs		loadgrams.hs
sdmtrain.hs		sdmtrain.hs
stack.yaml		stack.yaml
tokenizer.py		tokenizer.py
toktrain.hs		toktrain.hs