Skip to content

Commit

Permalink
minor fixes to fasttext tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
jayantj committed Jan 11, 2017
1 parent 7b0874a commit a7bceb6
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions docs/notebooks/FastText_Tutorial.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@
"The main principle behind FastText is that the morphological structure of a word carries important information about the meaning of the word, which is not taken into account by traditional word embeddings, which train a unique word embedding for every individual word. This is especially significant for morphologically rich languages (German, Turkish) in which a single word can have a large number of morphological forms, each of which might occur rarely, thus making it hard to train good word embeddings. \n",
"FastText attempts to solve this by treating each word as the aggregation of its subwords. For the sake of simplicity and language-independence, subwords are taken to the character ngrams of the word. The vector for a word is simply taken to be the sum of all vectors of its component char-ngrams. \n",
"According to a detailed comparison of Word2Vec and FastText in [this notebook](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Word2Vec_FastText_Comparison.ipynb), FastText does significantly better on syntactic tasks as compared to the original Word2Vec, especially when the size of the training corpus is small. Word2Vec slightly outperforms FastText on semantic tasks though. The differences grow smaller as the size of training corpus increases. \n",
"Training time for FastText is significantly higher than the Gensim version of FastText (`15min 42s` vs `6min 42s` on text8, 17 mil tokens, 5 epochs, and a vector size of 100). \n",
"FastText can be used to obtain vectors for out-of-vocabulary (oov) words, by summing up vectors for its component char-ngrams, provided atleast one of the char-ngrams was present in the training data."
"Training time for FastText is significantly higher than the Gensim version of Word2Vec (`15min 42s` vs `6min 42s` on text8, 17 mil tokens, 5 epochs, and a vector size of 100). \n",
"FastText can be used to obtain vectors for out-of-vocabulary (oov) words, by summing up vectors for its component char-ngrams, provided at least one of the char-ngrams was present in the training data."
]
},
{
Expand Down

0 comments on commit a7bceb6

Please sign in to comment.