Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Wrapper for FastText #847

Merged
merged 62 commits into from
Jan 24, 2017
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
55a4fc9
updated refactor
Aug 18, 2016
e916f7e
commit missed file
Aug 18, 2016
e5416ed
docstring added
Aug 18, 2016
e64766b
more refactoring
Aug 19, 2016
c34cf37
add missing docstring
Aug 19, 2016
c9b31f9
fix docstring format
Aug 19, 2016
a0329af
clearer docstring
droudy Aug 19, 2016
0c0e2fa
minor typo in word2vec wmdistance
jayantj Sep 2, 2016
cdefeb0
pyemd error in keyedvecs
jayantj Sep 8, 2016
1aec5a2
relative import of keyedvecs from word2vec fails
jayantj Sep 8, 2016
e7368a3
bug in init_sims in word2vec
jayantj Sep 8, 2016
fe283c2
property descriptors for syn0, syn0norm, index2word, vocab - fixes bu…
jayantj Sep 8, 2016
9b36bc4
tests for loading older word2vec models
jayantj Sep 9, 2016
dfe1893
backwards compatibility for loading older models
jayantj Sep 9, 2016
4a03f20
test for syn0norm not saved to file
jayantj Sep 9, 2016
09b6ebe
syn0norm not saved to file for KeyedVectors
jayantj Sep 9, 2016
7df4138
tests and fix for accuracy
jayantj Sep 9, 2016
4c54d9b
minor bug in finalized vocab check
jayantj Sep 9, 2016
a28f9f1
warnings for direct syn0/syn0norm access
jayantj Sep 9, 2016
bf1182e
fixes use of most_similar in accuracy
jayantj Sep 10, 2016
5a6b97b
changes logging level to ERROR in word2vec tests
jayantj Sep 10, 2016
cfb2e1c
renames kv to wv in word2vec
jayantj Sep 12, 2016
b002765
minor bugs with checking existence of syn0
jayantj Sep 12, 2016
27c0a14
replaces syn0 and syn0norm with wv.syn0 and wv.syn0norm in tests and …
jayantj Sep 12, 2016
81f8cbb
adds changelog
jayantj Sep 12, 2016
aa7e632
initial fastText wrapper class
jayantj Aug 29, 2016
c780b9b
fasttext load binary data + oov vectors
jayantj Aug 29, 2016
ccf5a47
tests for fasttext wrapper
jayantj Sep 9, 2016
708113b
reduced memory requirements for fasttext model
jayantj Sep 9, 2016
b7de266
annoy indexer tests for fasttext
jayantj Sep 12, 2016
4d3d251
adds changelog and documentation
jayantj Sep 12, 2016
f2d13ce
renames kv to wv in fasttext wrapper
jayantj Sep 12, 2016
3777423
refactors syn0 word vector lookup into method
jayantj Sep 12, 2016
6e20834
updates keyedvector load tests to use actual values
jayantj Dec 16, 2016
564ea0d
Merge branch 'develop' into fasttext
jayantj Dec 18, 2016
caeb275
updates word2vec load old models tests + test models
jayantj Dec 19, 2016
784ffbf
more fasttext wrapper tests
jayantj Dec 22, 2016
20fe6f2
refactoring of some fasttext and word2vec methods
jayantj Dec 22, 2016
3b9483b
refactors FastText to use subclass of KeyedVectors, updates tests
jayantj Dec 22, 2016
f5cdfb6
Merge branch 'develop' into fasttext
jayantj Dec 26, 2016
700dd26
changes setUp for fast text unittests to setUpClass to reduce time taken
jayantj Dec 26, 2016
d30ea56
adds normalized ngram vectors for fasttext model, tests
jayantj Dec 27, 2016
bb6e538
deletes training files after loading model, tests
jayantj Dec 27, 2016
c7a5d07
doesnt match with oov words, tests
jayantj Dec 27, 2016
734057b
more asserts while loading from fasttext model file, renames some var…
jayantj Dec 27, 2016
56d89e9
updates FastText __contains__ to return True for all words for which …
jayantj Dec 27, 2016
dc51096
updates docstrings, adds comments for fasttext wrapper and tests
jayantj Dec 27, 2016
bb48663
adds fasttext test models
jayantj Dec 27, 2016
b58dd53
changes setUpClass to setUp to allow python2.6 compatibility
jayantj Jan 3, 2017
461a6b4
updates word2vec test model files
jayantj Jan 4, 2017
9137090
python2.6 compatibility for fasttext tests
jayantj Jan 4, 2017
e5ae899
Revert "updates keyedvector load tests to use actual values"
jayantj Jan 4, 2017
b98b40f
Merge branch 'develop' into fasttext
jayantj Jan 4, 2017
5eb8f75
replaces all instances of vocab and syn0 being accessed directly thro…
jayantj Jan 4, 2017
27bec7b
adds fasttext tutorial notebook
jayantj Jan 6, 2017
ef0e1e2
minor doc updates
jayantj Jan 6, 2017
ab07ef9
removes direct vocab access in FastText
jayantj Jan 6, 2017
2f37b04
suppresses numpy overflow warning while computing fasttext hash
jayantj Jan 6, 2017
b2ff794
minor doc + pep8 updates
jayantj Jan 11, 2017
7b0874a
adds warning to doesnt_match if word vector is missing
jayantj Jan 11, 2017
a7bceb6
minor fixes to fasttext tutorial
jayantj Jan 11, 2017
dee9f97
Merge branch 'develop' into fasttext
tmylk Jan 24, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
Changes
=======


* Vectors for word2vec and doc2vec extracted out into `KeyedVectors`, save/load and similarity calcs can be run independent of model
- Maintains backwards compatibility, `w2v_model.syn0` and `w2v_model.syn0norm` raise a warning
* FastText wrapper added, can be used for training FastText word representations and performing word2vec operations over it
* Fix automatic learning of eta (prior over words) in LDA (@olavurmortensen, [#1024](https://github.com/RaRe-Technologies/gensim/pull/1024#)).
* eta should have dimensionality V (size of vocab) not K (number of topics). eta with shape K x V is still allowed, as the user may want to impose specific prior information to each topic.
* eta is no longer allowed the "asymmetric" option. Asymmetric priors over words in general are fine (learned or user defined).
Expand Down
38 changes: 19 additions & 19 deletions gensim/models/keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,15 @@ def save(self, *args, **kwargs):
kwargs['ignore'] = kwargs.get('ignore', ['syn0norm'])
super(KeyedVectors, self).save(*args, **kwargs)

def word_vec(self, word, use_norm=False):
if word in self.vocab:
if use_norm:
return self.syn0norm[self.vocab[word].index]
else:
return self.syn0[self.vocab[word].index]
else:
raise KeyError("word '%s' not in vocabulary" % word)

def most_similar(self, positive=[], negative=[], topn=10, restrict_vocab=None, indexer=None):
"""
Find the top-N most similar words. Positive words contribute positively towards the
Expand Down Expand Up @@ -89,11 +98,10 @@ def most_similar(self, positive=[], negative=[], topn=10, restrict_vocab=None, i
for word, weight in positive + negative:
if isinstance(word, ndarray):
mean.append(weight * word)
elif word in self.vocab:
mean.append(weight * self.syn0norm[self.vocab[word].index])
all_words.add(self.vocab[word].index)
else:
raise KeyError("word '%s' not in vocabulary" % word)
mean.append(weight * self.word_vec(word))
if word in self.vocab:
Copy link
Owner

@piskvorky piskvorky Jan 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code test, can never reach here (above line would throw a KeyError).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KeyError has been removed.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's still there, on line 66.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That line raises a KeyError in case word in self.vocab is False. So in case it's True, line 115 would be executed.
Also, word_vec has been overriden in the KeyedVectors subclass for FastText.

Copy link
Owner

@piskvorky piskvorky Jan 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my point is -- isn't it always True? How could it be False, when that would raise an exception at the line above? The test seems superfluous.

But if subclasses can make word_vec() behave differently (not raise for missing words), then it makes sense. Not sure what the general contract for word_vec() behaviour is.

all_words.add(self.vocab[word].index)
if not mean:
raise ValueError("cannot compute similarity with no input")
mean = matutils.unitvec(array(mean).mean(axis=0)).astype(REAL)
Expand Down Expand Up @@ -229,22 +237,14 @@ def most_similar_cosmul(self, positive=[], negative=[], topn=10):
# allow calls like most_similar_cosmul('dog'), as a shorthand for most_similar_cosmul(['dog'])
positive = [positive]

all_words = set()

def word_vec(word):
if isinstance(word, ndarray):
return word
elif word in self.vocab:
all_words.add(self.vocab[word].index)
return self.syn0norm[self.vocab[word].index]
else:
raise KeyError("word '%s' not in vocabulary" % word)

positive = [word_vec(word) for word in positive]
negative = [word_vec(word) for word in negative]
positive = [self.word_vec(word, use_norm=True) for word in positive]
negative = [self.word_vec(word, use_norm=True) for word in negative]
if not positive:
raise ValueError("cannot compute similarity with no input")

all_words = set([self.vocab[word].index for word in positive+negative if word in self.vocab])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the all_words created above for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To remove the input words from the returned most_similar words.

Copy link
Owner

@piskvorky piskvorky Jan 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, never mind, the review snippet showed me the code for all_words from most_similar above, I thought it's the same function. Disregard my comment.

Square brackets [ ] not needed inside the set().


# equation (4) of Levy & Goldberg "Linguistic Regularities...",
# with distances shifted to [0,1] per footnote (7)
pos_dists = [((1 + dot(self.syn0norm, term)) / 2) for term in positive]
Expand Down Expand Up @@ -314,7 +314,7 @@ def doesnt_match(self, words):
logger.debug("using words %s" % words)
if not words:
raise ValueError("cannot select a word from an empty list")
vectors = vstack(self.syn0norm[self.vocab[word].index] for word in words).astype(REAL)
vectors = vstack(self.word_vec(word) for word in words).astype(REAL)
mean = matutils.unitvec(vectors.mean(axis=0)).astype(REAL)
dists = dot(vectors, mean)
return sorted(zip(dists, words))[0][1]
Expand Down Expand Up @@ -344,9 +344,9 @@ def __getitem__(self, words):
"""
if isinstance(words, string_types):
# allow calls like trained_model['office'], as a shorthand for trained_model[['office']]
return self.syn0[self.vocab[words].index]
return self.word_vec(words)

return vstack([self.syn0[self.vocab[word].index] for word in words])
return vstack([self.word_vec(word) for word in words])

def __contains__(self, word):
return word in self.vocab
Expand Down
9 changes: 6 additions & 3 deletions gensim/models/word2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -420,13 +420,13 @@ def __init__(
texts are longer than 10000 words, but the standard cython code truncates to that maximum.)

"""

if FAST_VERSION == -1:
logger.warning('Slow version of {0} is being used'.format(__name__))
else:
logger.debug('Fast version of {0} is being used'.format(__name__))

self.wv = KeyedVectors() # wv --> KeyedVectors
self.initialize_word_vectors()
self.sg = int(sg)
self.cum_table = None # for negative sampling
self.vector_size = int(size)
Expand Down Expand Up @@ -460,6 +460,9 @@ def __init__(
self.build_vocab(sentences, trim_rule=trim_rule)
self.train(sentences)

def initialize_word_vectors(self):
self.wv = KeyedVectors() # wv --> word vectors
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comment, adds nothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


def make_cum_table(self, power=0.75, domain=2**31 - 1):
"""
Create a cumulative-distribution table using stored vocabulary word counts for
Expand Down Expand Up @@ -1617,4 +1620,4 @@ def __iter__(self):
model.accuracy(args.accuracy)

logger.info("finished running %s", program)

1 change: 1 addition & 0 deletions gensim/models/wrappers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@
from .ldamallet import LdaMallet
from .dtmmodel import DtmModel
from .ldavowpalwabbit import LdaVowpalWabbit
from .fasttext import FastText
231 changes: 231 additions & 0 deletions gensim/models/wrappers/fasttext.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-


"""
Python wrapper around word representation learning from FastText, a library for efficient learning
of word representations and sentence classification [1].

This module allows training a word embedding from a training corpus with the additional ability
to obtain word vectors for out-of-vocabulary words, using the fastText C implementation.

The wrapped model can NOT be updated with new documents for online training -- use gensim's
`Word2Vec` for that.

Example:

>>> model = gensim.models.wrappers.LdaMallet('/Users/kofola/fastText/fasttext', corpus_file='text8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be gensim.models.wrappers.FastText(..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, fixed. Thanks

>>> print model[word] # prints vector for given words

.. [1] https://github.com/facebookresearch/fastText#enriching-word-vectors-with-subword-information

"""


import logging
import tempfile
import os
import struct

import numpy as np

from gensim import utils
from gensim.models.keyedvectors import KeyedVectors
from gensim.models.word2vec import Word2Vec

from six import string_types

logger = logging.getLogger(__name__)


class FastTextKeyedVectors(KeyedVectors):
def word_vec(self, word, use_norm=False):
if word in self.vocab:
return super(FastTextKeyedVectors, self).word_vec(word, use_norm)
else:
word_vec = np.zeros(self.syn0_all.shape[1])
ngrams = FastText.compute_ngrams(word, self.min_n, self.max_n)
for ngram in ngrams:
if ngram in self.ngrams:
word_vec += self.syn0_all[self.ngrams[ngram]]
if word_vec.any():
return word_vec/len(ngrams)
else: # No ngrams of the word are present in self.ngrams
raise KeyError('all ngrams for word %s absent from model' % word)


class FastText(Word2Vec):
"""
Class for word vector training using FastText. Communication between FastText and Python
takes place by working with data files on disk and calling the FastText binary with
subprocess.call().
Implements functionality similar to [fasttext.py](https://github.com/salestock/fastText.py),
improving speed and scope of functionality like `most_similar`, `accuracy` by extracting vectors
into numpy matrix.

"""

def initialize_word_vectors(self):
self.wv = FastTextKeyedVectors() # wv --> word vectors

@classmethod
def train(cls, ft_path, corpus_file, output_file=None, model='cbow', size=100, alpha=0.025, window=5, min_count=5,
loss='ns', sample=1e-3, negative=5, iter=5, min_n=3, max_n=6, sorted_vocab=1, threads=12):
"""
`ft_path` is the path to the FastText executable, e.g. `/home/kofola/fastText/fasttext`.

`corpus_file` is the filename of the text file to be used for training the FastText model.
Expects file to contain space-separated tokens in a single line

`model` defines the training algorithm. By default, cbow is used. Accepted values are
cbow, skipgram.

`size` is the dimensionality of the feature vectors.

`window` is the maximum distance between the current and predicted word within a sentence.

`alpha` is the initial learning rate (will linearly drop to `min_alpha` as training progresses).

`min_count` = ignore all words with total frequency lower than this.

`loss` = defines training objective. Allowed values are `hs` (hierarchical softmax),
`ns` (negative sampling) and `softmax`. Defaults to `ns`

`sample` = threshold for configuring which higher-frequency words are randomly downsampled;
default is 1e-3, useful range is (0, 1e-5).

`negative` = the value for negative specifies how many "noise words" should be drawn
(usually between 5-20). Default is 5. If set to 0, no negative samping is used.
Only relevant when `loss` is set to `ns`

`iter` = number of iterations (epochs) over the corpus. Default is 5.

`min_n` = min length of char ngrams to be used for training word representations. Default is 1.

`max_n` = max length of char ngrams to be used for training word representations. Set `max_n` to be
greater than `min_n` to avoid char ngrams being used. Default is 5.

`sorted_vocab` = if 1 (default), sort the vocabulary by descending frequency before
assigning word indexes.

"""
ft_path = ft_path
output_file = output_file or os.path.join(tempfile.gettempdir(), 'ft_model')
ft_args = {
'input': corpus_file,
'output': output_file,
'lr': alpha,
'dim': size,
'ws': window,
'epoch': iter,
'minCount': min_count,
'neg': negative,
'loss': loss,
'minn': min_n,
'maxn': max_n,
'thread': threads,
't': sample
}
cmd = [ft_path, model]
for option, value in ft_args.items():
cmd.append("-%s" % option)
cmd.append(str(value))

output = utils.check_output(args=cmd)
model = cls.load_fasttext_format(output_file)
return model

@classmethod
def load_fasttext_format(cls, model_file):
model = cls.load_word2vec_format('%s.vec' % model_file)
model.load_binary_data('%s.bin' % model_file)
return model

def load_binary_data(self, model_file):
with open(model_file, 'rb') as f:
self.load_model_params(f)
self.load_dict(f)
self.load_vectors(f)

def load_model_params(self, f):
(dim, ws, epoch, minCount, neg, _, loss, model, bucket, minn, maxn, _, t) = self.struct_unpack(f, '@12i1d')
self.size = dim
self.window = ws
self.iter = epoch
self.min_count = minCount
self.negative = neg
self.loss = loss
self.sg = model == 'skipgram'
self.bucket = bucket
self.wv.min_n = minn
self.wv.max_n = maxn
self.sample = t

def load_dict(self, f):
(dim, nwords, _) = self.struct_unpack(f, '@3i')
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
ntokens, = self.struct_unpack(f, '@q')
for i in range(nwords):
word = ''
char, = self.struct_unpack(f, '@c')
char = char.decode()
while char != '\x00':
word += char
char, = self.struct_unpack(f, '@c')
char = char.decode()
count, _ = self.struct_unpack(f, '@ib')
_ = self.struct_unpack(f, '@i')
assert self.wv.vocab[word].index == i, 'mismatch between gensim word index and fastText word index'
self.wv.vocab[word].count = count

def load_vectors(self, f):
num_vectors, dim = self.struct_unpack(f, '@2q')
float_size = struct.calcsize('@f')
if float_size == 4:
dtype = np.dtype(np.float32)
elif float_size == 8:
dtype = np.dtype(np.float64)

self.num_original_vectors = num_vectors
self.wv.syn0_all = np.fromstring(f.read(num_vectors * dim * float_size), dtype=dtype)
self.wv.syn0_all = self.wv.syn0_all.reshape((num_vectors, dim))
self.init_ngrams()

def struct_unpack(self, f, fmt):
num_bytes = struct.calcsize(fmt)
return struct.unpack(fmt, f.read(num_bytes))

def init_ngrams(self):
self.wv.ngrams = {}
all_ngrams = []
for w, v in self.vocab.items():
all_ngrams += self.compute_ngrams(w, self.wv.min_n, self.wv.max_n)
all_ngrams = set(all_ngrams)
self.num_ngram_vectors = len(all_ngrams)
ngram_indices = []
for i, ngram in enumerate(all_ngrams):
ngram_hash = self.ft_hash(ngram)
ngram_indices.append((len(self.wv.vocab) + ngram_hash) % self.bucket)
self.wv.ngrams[ngram] = i
self.wv.syn0_all = self.wv.syn0_all.take(ngram_indices, axis=0)

@staticmethod
def compute_ngrams(word, min_n, max_n):
ngram_indices = []
BOW, EOW = ('<','>')
extended_word = BOW + word + EOW
ngrams = set()
for i in range(len(extended_word) - min_n + 1):
for j in range(min_n, max(len(extended_word) - max_n, max_n + 1)):
ngrams.add(extended_word[i:i+j])
return ngrams

@staticmethod
def ft_hash(string):
# Reproduces hash method used in fastText
h = np.uint32(2166136261)
for c in string:
h = h ^ np.uint32(ord(c))
h = h * np.uint32(16777619)
return h

Binary file added gensim/test/test_data/word2vec_pre_kv
Binary file not shown.
Binary file modified gensim/test/test_data/word2vec_pre_kv_py3_4
Binary file not shown.
Binary file added gensim/test/test_data/word2vec_pre_kv_sep
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file modified gensim/test/test_data/word2vec_pre_kv_sep_py3_4
Binary file not shown.
Binary file modified gensim/test/test_data/word2vec_pre_kv_sep_py3_4.syn0_lockf.npy
Binary file not shown.
Binary file modified gensim/test/test_data/word2vec_pre_kv_sep_py3_4.syn1neg.npy
Binary file not shown.
Binary file modified gensim/test/test_data/word2vec_pre_kv_sep_py3_4.wv.syn0.npy
Binary file not shown.
Loading