-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [DNM] Keyedvector load word2vec format #1078
[WIP] [DNM] Keyedvector load word2vec format #1078
Conversation
This reverts commit 6e20834. Conflicts: gensim/test/test_word2vec.py
Conflicts: gensim/models/word2vec.py
…ugh word2vec instance
I have some reservations about such a change. There is a strong expectation that the result of a Of course, prior Word2Vec models loaded via Other options could include: (1) eliminating the method entirely, or leaving it as a stub which simply fails with a warning to use KeyedVectors instead; (2) having it pass through to KeyedVectors, but log a deprecation warning and (for now) instantiate a Word2Vec model composed from that KeyedVectors – but eventually drop the direct Whatever is done, related methods like |
Option 2 is in line with existing ways of handling direct access to model.vocab |
This whole KeyedVectors stuff deserves an explanatory blog post / notebook. I feel a lot of people will be confused by the changes and warnings that suddenly jump at them with new gensim versions (and for no immediately apparent benefit -- are the benefits and motivation succinctly summarized somewhere? linked from README?). In general, the same is true for any deprecations. Whenever we're removing something, there should be a link to a clear text of "why" (no good why = don't even accept the change), with examples what users ought to do to become "compliant" again. |
From what I understand, the main reason for the KeyedVecs refactoring was to separate out the set-of-vectors-with-labels from the full Word2Vec/Doc2Vec training model. These were the reasons/advantages -
@gojomo and @tmylk can probably elaborate on this/explain things better. I don't think any of these benefits are particularly relevant to users, so I'm not sure what an explanatory blog post would contain. |
I think the post could at least explain how to adjust their old code so it doesn't produce warnings in the new releases :) With some concrete examples too, of course. And then we could link to this notebook/post from the warning/deprecation message, because people are too busy/lazy to find it out otherwise. We want to be explicit and straightforward. |
About the code changes in this PR. Option 2) suggested by @gojomo above is most preferable.
Was hoping to merge in wordrank wrapper this week so might take just |
@piskvorky There is too little content for a blog post but Release notes updated with instructions and benefits. Mailing list is notified and a tweet will come out in the GMT morning. |
Cool, thanks. |
Merged in #1107 |
Includes commits from #847
The purpose of this PR is to change
Word2Vec.load_word2vec_format
to return aKeyedVectors
instance instead of a Word2Vec instance.The behaviour doesn't change significantly, since all the similarity operations and vector lookups that could be done with a Word2Vec instance created from a
load_word2vec_format
call can still be done via theKeyedVectors
instance. It does change the class of the returned instance (obviously) and updates a bunch of tests to be compatible with the new method.The FastText wrapper from #847 also required some changes to work with the updated method.
To be merged after #847 .