Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [DNM] Keyedvector load word2vec format #1078

Closed

Conversation

jayantj
Copy link
Contributor

@jayantj jayantj commented Jan 6, 2017

Includes commits from #847

The purpose of this PR is to change Word2Vec.load_word2vec_format to return a KeyedVectors instance instead of a Word2Vec instance.

The behaviour doesn't change significantly, since all the similarity operations and vector lookups that could be done with a Word2Vec instance created from a load_word2vec_format call can still be done via the KeyedVectors instance. It does change the class of the returned instance (obviously) and updates a bunch of tests to be compatible with the new method.

The FastText wrapper from #847 also required some changes to work with the updated method.

To be merged after #847 .

@tmylk tmylk changed the title [WIP] Keyedvector load word2vec format [WIP] [DNM] Keyedvector load word2vec format Jan 6, 2017
@gojomo
Copy link
Collaborator

gojomo commented Jan 6, 2017

I have some reservations about such a change.

There is a strong expectation that the result of a load... method on a type returns something of the same type – both from general practices, and in the history of this class.

Of course, prior Word2Vec models loaded via load_word2vec_format() were in fact kind-of crippled – only usable for some operations, unless doing extra custom futzing with the internals. They were more-or-less sets-of-KeyedVectors. But the transition to returning a different, more focused type seems to me something that shouldn't be silently glossed-over.

Other options could include: (1) eliminating the method entirely, or leaving it as a stub which simply fails with a warning to use KeyedVectors instead; (2) having it pass through to KeyedVectors, but log a deprecation warning and (for now) instantiate a Word2Vec model composed from that KeyedVectors – but eventually drop the direct load_word2vec_format() entirely.

Whatever is done, related methods like intersect_word2vec_format() should be handled in a sensibly analogous fashion.

@tmylk
Copy link
Contributor

tmylk commented Jan 6, 2017

Option 2 is in line with existing ways of handling direct access to model.vocab

@piskvorky
Copy link
Owner

piskvorky commented Jan 8, 2017

This whole KeyedVectors stuff deserves an explanatory blog post / notebook. I feel a lot of people will be confused by the changes and warnings that suddenly jump at them with new gensim versions (and for no immediately apparent benefit -- are the benefits and motivation succinctly summarized somewhere? linked from README?).

In general, the same is true for any deprecations. Whenever we're removing something, there should be a link to a clear text of "why" (no good why = don't even accept the change), with examples what users ought to do to become "compliant" again.

@jayantj
Copy link
Contributor Author

jayantj commented Jan 10, 2017

From what I understand, the main reason for the KeyedVecs refactoring was to separate out the set-of-vectors-with-labels from the full Word2Vec/Doc2Vec training model. These were the reasons/advantages -

  1. No more broken training models (e.g. from loading word2vec c format models, or from calling init_sims(replace=True)) wrapped inside Word2Vec instances
  2. Cleaner organization and logic, makes code reuse in the future easier
  3. Some wishlist features for word2vec/doc2vec only need the final vector sets, and don't care about the other vectors used in training and training hyperparameters - should make use of KeyedVectors rather than Word2Vec
  4. In case only the set-of-vectors are needed, they can be loaded/saved independently of the trainable models, saving the extra space that is otherwise required for keeping around the extra vectors/parameters.

@gojomo and @tmylk can probably elaborate on this/explain things better.

I don't think any of these benefits are particularly relevant to users, so I'm not sure what an explanatory blog post would contain.

@piskvorky
Copy link
Owner

piskvorky commented Jan 10, 2017

I think the post could at least explain how to adjust their old code so it doesn't produce warnings in the new releases :) With some concrete examples too, of course.

And then we could link to this notebook/post from the warning/deprecation message, because people are too busy/lazy to find it out otherwise. We want to be explicit and straightforward.

@tmylk
Copy link
Contributor

tmylk commented Jan 11, 2017

About the code changes in this PR. Option 2) suggested by @gojomo above is most preferable.

KeyedVectors.load_word2vec_format returns a KeyedVectors instance. This is needed in Fasttext and Wordrank PRs ( #847 and #1066)
Word2vec.load_word2vec_format returns a word2vec instance and shows a deprecation warning.

Was hoping to merge in wordrank wrapper this week so might take just load_word2vec_format changes from this PR

@tmylk
Copy link
Contributor

tmylk commented Jan 11, 2017

@piskvorky There is too little content for a blog post but Release notes updated with instructions and benefits. Mailing list is notified and a tweet will come out in the GMT morning.

@piskvorky
Copy link
Owner

Cool, thanks.

@tmylk
Copy link
Contributor

tmylk commented Feb 25, 2017

Merged in #1107

@tmylk tmylk closed this Feb 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants