Skip to content

sebeaumont/fruit

Repository files navigation

Project for general text/ngram learning tools

TODO

  • Use sqlite to load/store data rather than good 'ol stdio

  • Refactor

    • json -> sqlite
    • tokenizer -> sqlite
    • frame/ngram picker (use term keys in sqlite)
  • Embedding

  • Clustering

Deep Learning using Grenade etc.

This now on hold until above re-worked

  • Input docs as json lines

    • lazy stream or one line/doc at a time? latter done.
    • check performance of strictness annotations in document record etc.
  • tokenize

    • capitalization (char-rnn annotation trick?), punctuation, numbers,
    • light touch done. Going with word-RNN for now.
  • Frames TODO

    • output direclty to frametrain if using sdr pro-tem re-engineer later.
    • What does text2vec do?
  • EMBEDDINGS text2vec vs. SDR

    • why is text2vec hand-wavy about semantic locality
    • what does text2vec do?
    • SDR is clear => use SDR with downsampling SDRs/or sparse rep as they are very big input space.
      • TODO compare metrics with downsampled vectors also space requirements w.r.t. typical one-hot encoding.
      • Can we form a relational category with partial order over distance/overlap metric?
  • TERMS rare vs. common words filtering/folding?

  • LSTM word sequence model

    • char-RNN vs. word-RNN word on the street is word for performance and profit.
    • training time on example is a wee bit pedestrian on e.g. Shakespeare corpus but we only do that once.
    • could just chuck some other narrative at it and see what we get it's good way of evidencing the language model.
    • will need to mess with hyper-parameters and network design of course.

Releases

No releases published

Packages

No packages published