Skip to content
Xav edited this page Jul 22, 2015 · 25 revisions

v0.0.23

  • Improvements of interrogative type detection (fix of some test cases, add new test cases)
  • Numeric tokens now provide a value attribute representing the real value of the number, typed as Javascript number
  • Fix singular attributes for noun token not being set
  • Detectors can now be executed before dependency parsing
    • compendium.detectors.add is deprecated in favor of compendium.detectors.after, will be removed in v1.0.0
    • compendium.detectors.after registers detectors that will be executed after dependency parsing
    • compendium.detectors.before registers detectors that will be executed before dependency parsing

v0.0.22

  • Better like token handling: transform into preposition when possible (I like that vs It's like that)
  • Better have token handling: rarely a noun
  • Roman numerals handling (Chapter IV, Henri III)
  • Improved Natural Entity Recognition (more patterns such as IO2009, CamelCased Inc., Henry III...)
  • Bug fixes
    • Avoid duplicate items in lexicon (leading to wrong PoS tagging and sentiment analysis)
    • Avoid raw tokens being normalized

v0.0.21

  • Fix Missing infinitives for some verb tokens
  • Add tense attribute to verb tokens

v0.0.20

  • Fix #3: raw field is a reconstruction of the sentence, not the actual raw string. Fixed by providing the real raw string.
  • Scaffolds some code for multilingual use of compendium - for now one build per language
    • Add post processors to lexer for language-specific tokens handling
    • Reorganize sources to have a clean multilingual directory structure (to be continued)
    • Create gulp build tasks for french language
    • Add initial tests for french language

v0.0.19

  • Minor improvements of english dependency parsing with new tests
  • Minor improvements of profiling
  • NER fixes

v0.0.18

  • Fix infinite loop in dependency parsing

v0.0.17

  • New dependency parsing rules and tests
  • New PoS rules
  • Fix some token sentiment scores being skipped when building lexicon
  • Add experimental dependency-based sentiment score propagation
  • Allow lexicon sentiment scores to be floats

v0.0.16

  • Sentiment analysis: better "mixed" tagging by comparing amplitude to score in the case of low score + medium amplitude
  • Better handling of quotes (lexer, PoS)
  • Slight cleanup of some lexicon symbols

v0.0.15

  • Sentence types: add refusal type
  • Negation detection slight refactoring (negation is expended to negation mark master verb)

v0.0.14

  • Remove cleaner step (replaced by synonyms handler)
  • Sentence types: add approval type
  • Dependency parsing: add new governors ranks
  • Token attributes: add is_punc attribute
  • Add new Brill rules (+0.1% on Penn Treebank)
  • Statistics: add words stat (number of actual words in a sentence: tokens length - punc, emots...)
  • New tests + some tests refactoring

v0.0.13

  • Fix issues
    • Missing 're contraction
    • Lexer bit too greedy with emoticons (was catching -s in inter-sport)
  • Improved dependency parsing
    • Third rank of governors
    • More governor tag candidates
  • Sentence type imperative by looking up for VB governors
  • New Brill rules

v0.0.12

  • Remove annoying console.log
  • Few new Brill rules
  • Better looking example page + readme screenshot
  • Fix bug that skipped lot of emoticons when building lexicons

v0.0.11

  • Verbs
    • Irregular verbs conjugation + integration in lexicon
    • Regular verbs in Lexicon
    • Basic tense detection (for simple sentences, based on dependency parsing)
  • Numerous new Brill's rules for PoS tagging (92.519% on Penn Treebank)
  • Improved dependency parsing
  • Trie class interface
  • Bit of code documentation
  • Sentence detectors are now applied directly in analysis sentence loop (not anymore in a dedicated second loop)
  • New attributes for tokens (is_verb, infinitive, is_noun, plural, singular)
  • *in > *ing inference (if a word ends with in, is not in lexicon, and the same word plus g exists in lexicon, then infer it as VBG)
  • New tests

v0.0.10

v0.0.9

  • Improved token PoS tagging (+0.8% on Penn treebank!):
    • Order of detectors changed
    • Better management of composed words
  • First step of scaffolding for dependency parsing feature

v0.0.8

  • All regular verbs now conjugated (and/or conjugable)
  • PoS tagging for verbs greatly improved
  • Better packing of verbs and nationalities (-2ko)
  • Better filtering of lexicon (-1ko)
  • Reorganised a bit the project
    • Lexicon data files moved to src/lexicon
    • Compendium data files moved to src/dictionaries
  • Lot of news tests (isSingular, verbs, lexicon...)
  • Refactored detectors API so it's a bid less verbose

v0.0.7

  • Better sentiment profiling for mixed sentiment, in particular when using multiple adverbs
  • Politeness, dirtiness scores
  • Synonyms feature for tokens normalization
    • Used by PoS tagger in case no other method returned a tag

v0.0.6

  • Add interrogative and exclamatory sentence types
  • Fix low confidence for obvious PoS tagging (CD, SYM...)
  • [Gulpfile] Add test run on live rebuild

v0.0.5

  • Statistics skips punctuation tokens
  • Improve verb inflector
  • Better sentiment profiling
  • Better breakpoint detection