[WIP] [DNM] Keyedvector load word2vec format #1078

jayantj · 2017-01-06T08:08:38Z

Includes commits from #847

The purpose of this PR is to change Word2Vec.load_word2vec_format to return a KeyedVectors instance instead of a Word2Vec instance.

The behaviour doesn't change significantly, since all the similarity operations and vector lookups that could be done with a Word2Vec instance created from a load_word2vec_format call can still be done via the KeyedVectors instance. It does change the class of the returned instance (obviously) and updates a bunch of tests to be compatible with the new method.

The FastText wrapper from #847 also required some changes to work with the updated method.

To be merged after #847 .

…g in saving

…cython files

…iables

…vectors exist

This reverts commit 6e20834. Conflicts: gensim/test/test_word2vec.py

Conflicts: gensim/models/word2vec.py

…ugh word2vec instance

gojomo · 2017-01-06T22:02:44Z

I have some reservations about such a change.

There is a strong expectation that the result of a load... method on a type returns something of the same type – both from general practices, and in the history of this class.

Of course, prior Word2Vec models loaded via load_word2vec_format() were in fact kind-of crippled – only usable for some operations, unless doing extra custom futzing with the internals. They were more-or-less sets-of-KeyedVectors. But the transition to returning a different, more focused type seems to me something that shouldn't be silently glossed-over.

Other options could include: (1) eliminating the method entirely, or leaving it as a stub which simply fails with a warning to use KeyedVectors instead; (2) having it pass through to KeyedVectors, but log a deprecation warning and (for now) instantiate a Word2Vec model composed from that KeyedVectors – but eventually drop the direct load_word2vec_format() entirely.

Whatever is done, related methods like intersect_word2vec_format() should be handled in a sensibly analogous fashion.

tmylk · 2017-01-06T22:14:09Z

Option 2 is in line with existing ways of handling direct access to model.vocab

piskvorky · 2017-01-08T04:07:28Z

This whole KeyedVectors stuff deserves an explanatory blog post / notebook. I feel a lot of people will be confused by the changes and warnings that suddenly jump at them with new gensim versions (and for no immediately apparent benefit -- are the benefits and motivation succinctly summarized somewhere? linked from README?).

In general, the same is true for any deprecations. Whenever we're removing something, there should be a link to a clear text of "why" (no good why = don't even accept the change), with examples what users ought to do to become "compliant" again.

jayantj · 2017-01-10T07:45:01Z

From what I understand, the main reason for the KeyedVecs refactoring was to separate out the set-of-vectors-with-labels from the full Word2Vec/Doc2Vec training model. These were the reasons/advantages -

No more broken training models (e.g. from loading word2vec c format models, or from calling init_sims(replace=True)) wrapped inside Word2Vec instances
Cleaner organization and logic, makes code reuse in the future easier
Some wishlist features for word2vec/doc2vec only need the final vector sets, and don't care about the other vectors used in training and training hyperparameters - should make use of KeyedVectors rather than Word2Vec
In case only the set-of-vectors are needed, they can be loaded/saved independently of the trainable models, saving the extra space that is otherwise required for keeping around the extra vectors/parameters.

@gojomo and @tmylk can probably elaborate on this/explain things better.

I don't think any of these benefits are particularly relevant to users, so I'm not sure what an explanatory blog post would contain.

piskvorky · 2017-01-10T10:19:35Z

I think the post could at least explain how to adjust their old code so it doesn't produce warnings in the new releases :) With some concrete examples too, of course.

And then we could link to this notebook/post from the warning/deprecation message, because people are too busy/lazy to find it out otherwise. We want to be explicit and straightforward.

tmylk · 2017-01-11T14:43:52Z

About the code changes in this PR. Option 2) suggested by @gojomo above is most preferable.

KeyedVectors.load_word2vec_format returns a KeyedVectors instance. This is needed in Fasttext and Wordrank PRs ( #847 and #1066)
Word2vec.load_word2vec_format returns a word2vec instance and shows a deprecation warning.

Was hoping to merge in wordrank wrapper this week so might take just load_word2vec_format changes from this PR

tmylk · 2017-01-11T22:13:42Z

@piskvorky There is too little content for a blog post but Release notes updated with instructions and benefits. Mailing list is notified and a tweet will come out in the GMT morning.

piskvorky · 2017-01-12T00:04:20Z

Cool, thanks.

tmylk · 2017-02-25T01:52:07Z

Merged in #1107

droudy and others added 30 commits September 12, 2016 19:06

updated refactor

55a4fc9

commit missed file

e916f7e

docstring added

e5416ed

more refactoring

e64766b

add missing docstring

c34cf37

fix docstring format

c9b31f9

clearer docstring

a0329af

minor typo in word2vec wmdistance

0c0e2fa

pyemd error in keyedvecs

cdefeb0

relative import of keyedvecs from word2vec fails

1aec5a2

bug in init_sims in word2vec

e7368a3

property descriptors for syn0, syn0norm, index2word, vocab - fixes bu…

fe283c2

…g in saving

tests for loading older word2vec models

9b36bc4

backwards compatibility for loading older models

dfe1893

test for syn0norm not saved to file

4a03f20

syn0norm not saved to file for KeyedVectors

09b6ebe

tests and fix for accuracy

7df4138

minor bug in finalized vocab check

4c54d9b

warnings for direct syn0/syn0norm access

a28f9f1

fixes use of most_similar in accuracy

bf1182e

changes logging level to ERROR in word2vec tests

5a6b97b

renames kv to wv in word2vec

cfb2e1c

minor bugs with checking existence of syn0

b002765

replaces syn0 and syn0norm with wv.syn0 and wv.syn0norm in tests and …

27c0a14

…cython files

adds changelog

81f8cbb

initial fastText wrapper class

aa7e632

fasttext load binary data + oov vectors

c780b9b

tests for fasttext wrapper

ccf5a47

reduced memory requirements for fasttext model

708113b

annoy indexer tests for fasttext

b7de266

jayantj added 17 commits December 27, 2016 16:52

deletes training files after loading model, tests

bb6e538

doesnt match with oov words, tests

c7a5d07

more asserts while loading from fasttext model file, renames some var…

734057b

…iables

updates FastText __contains__ to return True for all words for which …

56d89e9

…vectors exist

updates docstrings, adds comments for fasttext wrapper and tests

dc51096

adds fasttext test models

bb48663

changes setUpClass to setUp to allow python2.6 compatibility

b58dd53

updates word2vec test model files

461a6b4

python2.6 compatibility for fasttext tests

9137090

Revert "updates keyedvector load tests to use actual values"

e5ae899

This reverts commit 6e20834. Conflicts: gensim/test/test_word2vec.py

Merge branch 'develop' into fasttext

b98b40f

Conflicts: gensim/models/word2vec.py

replaces all instances of vocab and syn0 being accessed directly thro…

5eb8f75

…ugh word2vec instance

adds fasttext tutorial notebook

27bec7b

minor doc updates

ef0e1e2

removes direct vocab access in FastText

ab07ef9

suppresses numpy overflow warning while computing fasttext hash

2f37b04

load_word2vec_format returns KeyedVector, minor refactoring

5653632

jayantj mentioned this pull request Jan 6, 2017

[MRG] Wrapper for FastText #847

Merged

tmylk changed the title ~~[WIP] Keyedvector load word2vec format~~ [WIP] [DNM] Keyedvector load word2vec format Jan 6, 2017

tmylk mentioned this pull request Jan 24, 2017

Move load and save word2vec_format out of word2vec class to KeyedVectors #1107

Merged

tmylk closed this Feb 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [DNM] Keyedvector load word2vec format #1078

[WIP] [DNM] Keyedvector load word2vec format #1078

jayantj commented Jan 6, 2017

gojomo commented Jan 6, 2017

tmylk commented Jan 6, 2017 •

edited

Loading

piskvorky commented Jan 8, 2017 •

edited

Loading

jayantj commented Jan 10, 2017

piskvorky commented Jan 10, 2017 •

edited

Loading

tmylk commented Jan 11, 2017

tmylk commented Jan 11, 2017

piskvorky commented Jan 12, 2017

tmylk commented Feb 25, 2017

[WIP] [DNM] Keyedvector load word2vec format #1078

[WIP] [DNM] Keyedvector load word2vec format #1078

Conversation

jayantj commented Jan 6, 2017

gojomo commented Jan 6, 2017

tmylk commented Jan 6, 2017 • edited Loading

piskvorky commented Jan 8, 2017 • edited Loading

jayantj commented Jan 10, 2017

piskvorky commented Jan 10, 2017 • edited Loading

tmylk commented Jan 11, 2017

tmylk commented Jan 11, 2017

piskvorky commented Jan 12, 2017

tmylk commented Feb 25, 2017

tmylk commented Jan 6, 2017 •

edited

Loading

piskvorky commented Jan 8, 2017 •

edited

Loading

piskvorky commented Jan 10, 2017 •

edited

Loading