Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Wrapper for FastText #847

Merged
merged 62 commits into from
Jan 24, 2017
Merged

[MRG] Wrapper for FastText #847

merged 62 commits into from
Jan 24, 2017

Conversation

jayantj
Copy link
Contributor

@jayantj jayantj commented Sep 1, 2016

I've branched this off @droudy's KeyedVectors PR (#833). The changes I've made are all in gensim/models/wrappers/fasttext.py

Just wondering if this is the correct approach to go with, if we can decide that, I'll properly document the code, write tests, refactor it, right now it's mostly a POC. The main things I'm concerned about are -

  1. Portability (using structs to read off the binary file generated by FastText, I'm not completely sure about how portable this is across different compiler versions and architectures)
  2. FastText changes to binary file - If FastText makes any changes to the binary file, we'll have to make those changes too, and since the binary file seems to be mostly for internal use, it could very well see changes?
  3. Memory usage - FastText takes up a huge amount of memory - with a tiny vocabulary of around 4k, my memory usage goes to ~750 MB. Regular word2vec takes up ~70 MB for the same. This seems to be due a lot of unused vectors assigned for the weight matrix, so it should be possible to reduce it significantly

Steps to reproduce (you'll need to setup FastText first -

from gensim.models.wrappers import FastText
ft_model = FastText.train(<fast_text_path>, <corpus_file>)
ft_model[<in_vocab_word>]
ft_model[<oov_word>]

I've manually verified on a small set that it produces the same results as FastText

@gojomo @tmylk

@piskvorky
Copy link
Owner

Awesome! We want to finalize and merge the KeyedVector functionality ASAP, so things don't diverge too much.

@jayantj jayantj mentioned this pull request Sep 9, 2016
@jayantj
Copy link
Contributor Author

jayantj commented Sep 9, 2016

What do you think would be the best way to handle OOV words with KeyedVectors?
Operations like most_similar, similarity etc. all need to be modified to allow for out of vocab words. Subclassing KeyedVectors and using that subclass inside the FastText class?

@jayantj
Copy link
Contributor Author

jayantj commented Jan 6, 2017

The build is passing now (finally).
Added a tutorial for the FastText wrapper.
Let me know if there's any changes you think should be made.

Regarding KeyedVecs - the WordRank wrapper seems to return a Word2Vec instance right now.
I agree it makes sense to have FastText return a KeyedVecs instance, but it requires the Word2Vec.load_word2vec_format to return a KeyedVecs instance too. That seems like a fairly major change, as the load_word2vec_format is (probably) a widely used method. It'd also break quite a few tests in the current test_word2vec.py.
I think it'd make more sense to have a separate PR for that. Along with that PR, I'll also change the FastText wrapper to return a KeyedVecs instance.

One last thing - the FastText tests which involve actually using the FastText binary do not run on the build system. I've run them locally. Is this fine, or is there a way to setup FastText on the build system? Didn't see something similar for the Mallet LDA wrapper.

Thoughts?

@jayantj
Copy link
Contributor Author

jayantj commented Jan 6, 2017

On second thoughts, I'm unsure if the FastText wrapper should return a KeyedVec instance. The loaded FastText model isn't simply a set of vectors, it contains complete data about the model (hyperparameters and weight matrices). Theoretically, the model can be further trained, we are unable to, currently, because of limitations in the FastText CLI.
Maybe a FastText instance makes most sense. The load_word2vec_format method in Word2Vec definitely should be changed to return a KeyedVecs instance.

@jayantj jayantj changed the title [WIP] Wrapper for FastText [RG] Wrapper for FastText Jan 6, 2017
@jayantj jayantj changed the title [RG] Wrapper for FastText [MRG] Wrapper for FastText Jan 6, 2017
@jayantj
Copy link
Contributor Author

jayantj commented Jan 6, 2017

Pushed a new branch and created a PR #1078 which contains the update to load_word2vec_format, tests and the FastText wrapper.

@tmylk
Copy link
Contributor

tmylk commented Jan 6, 2017

We don't test wrappers for DTM, Mallet etc in Travis as it takes a long time to download/compile the binaries for them. We do test them prior to a release though. Suggest having the same process for FastText

@jayantj
Copy link
Contributor Author

jayantj commented Jan 9, 2017

@tmylk thanks, that sounds good.

@piskvorky I'm unable to request a review via Github, the option doesn't seem to show up under "Reviewers".
Would appreciate if you could look over this once.
Link to the tutorial for the FastText wrapper.


Example::

>>> trained_model['office']
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading example (doesn't use this method at all).

else:
raise KeyError("word '%s' not in vocabulary" % word)
mean.append(weight * self.word_vec(word, use_norm=True))
if word in self.vocab:
Copy link
Owner

@piskvorky piskvorky Jan 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code test, can never reach here (above line would throw a KeyError).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KeyError has been removed.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's still there, on line 66.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That line raises a KeyError in case word in self.vocab is False. So in case it's True, line 115 would be executed.
Also, word_vec has been overriden in the KeyedVectors subclass for FastText.

Copy link
Owner

@piskvorky piskvorky Jan 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my point is -- isn't it always True? How could it be False, when that would raise an exception at the line above? The test seems superfluous.

But if subclasses can make word_vec() behave differently (not raise for missing words), then it makes sense. Not sure what the general contract for word_vec() behaviour is.

if not positive:
raise ValueError("cannot compute similarity with no input")

all_words = set([self.vocab[word].index for word in positive+negative if word in self.vocab])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the all_words created above for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To remove the input words from the returned most_similar words.

Copy link
Owner

@piskvorky piskvorky Jan 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, never mind, the review snippet showed me the code for all_words from most_similar above, I thought it's the same function. Disregard my comment.

Square brackets [ ] not needed inside the set().

if not words:
raise ValueError("cannot select a word from an empty list")
vectors = vstack(self.syn0norm[self.vocab[word].index] for word in words).astype(REAL)
logger.debug("using words %s" % words)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use lazy log formatting (params passed via comma, not directly formatted, which is mostly wasteful because debug is not output).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

vectors = vstack(self.syn0norm[self.vocab[word].index] for word in words).astype(REAL)
logger.debug("using words %s" % words)
vectors = []
for word in words:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line that initialized words was removed above, so how does this work?

The previous version seemed shorter and cleaner, what is the purpose of this explicit loop?

Copy link
Contributor Author

@jayantj jayantj Jan 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

words was also the name of one of the params of the method, and the removed line was filtering out oov words.
Renamed/reworked things to make things a little clearer in the latest changes.

except KeyError:
logger.debug("vector for word %s not present, ignoring the word", word)
if not vectors:
raise ValueError("vector for all given words absent")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confusing message, please rephrase. How about "cannot compute similarity with no input" like above, for consistency?

Also wondering whether we should throw an exception in case of missing words, rather than silently ignoring them (DEBUG level message is almost ignore).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to a warning for now, that should be noticeable.
Not sure if an exception would be ideal, since it's a change in behaviour, and in case the intention is only to make things more transparent, a warning probably serves the purpose best.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks.

@@ -469,6 +469,9 @@ def __init__(
self.build_vocab(sentences, trim_rule=trim_rule)
self.train(sentences)

def initialize_word_vectors(self):
self.wv = KeyedVectors() # wv --> word vectors
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comment, adds nothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


def load_binary_data(self, model_binary_file):
"""Loads data from the output binary file created by FastText training"""
with open(model_binary_file, 'rb') as f:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use smart_open (everywhere).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@staticmethod
def compute_ngrams(word, min_n, max_n):
ngram_indices = []
BOW, EOW = ('<','>') # Used by FastText to attach to all words as prefix and suffix
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: space after comma.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@piskvorky
Copy link
Owner

piskvorky commented Jan 11, 2017

@jayantj left a few comments in the review.

Looking at the notebook:

  • What does Training time for FastText is significantly higher than the Gensim version of FastText mean? How can a wrapper be faster than the original?
  • atleast => at least (in multiple places)

Otherwise looks good 👍

@jayantj
Copy link
Contributor Author

jayantj commented Jan 11, 2017

Thanks for the review, much appreciated.
The line Training time for FastText is significantly higher than the Gensim version of FastText was meant to be Training time for FastText is significantly higher than the Gensim version of Word2Vec - fixed.

@tmylk tmylk merged commit 2a70e3a into piskvorky:develop Jan 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty hard Hard issue: required deep gensim understanding & high python/cython skills feature Issue described a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants