Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new options for load/intersect_word2vec_format #817

Closed
wants to merge 6 commits into from

Conversation

gojomo
Copy link
Collaborator

@gojomo gojomo commented Aug 9, 2016

load_word2vec_format:

  • limit to only read 1st N vectors from file
  • datatype to force smaller datatype (risky/slow but can save memory)

intersect_word2vec_format:

  • lockf argument to set whether imported vectors are locked-against-changes (0.0) or not (1.0)

also: tiny fix in Text8Corpus, ensuring final chunk fragment gets same unicode-treatment as all other created-lines.

@gojomo gojomo changed the title new load_word2vec_format, intersect_word2vec_format options new options for load/intersect_word2vec_format Aug 9, 2016
@gojomo
Copy link
Collaborator Author

gojomo commented Aug 10, 2016

@piskvorky @tmylk Some small improvements for loading word2vec.c-format vectors, and a tiny fix for issue I hit when comparing w/ fastText. Please ack for merge before drifts-into-conflict?

@@ -1120,7 +1135,8 @@ def add_word(word, weights):
weights = fromstring(fin.read(binary_len), dtype=REAL)
add_word(word, weights)
else:
for line_no, line in enumerate(fin):
for line_no in xrange(vocab_size):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without it, any user-specified limit cap on vocab_size won't be respected for binary=False-format reads.

(Without a limit, the only way this would prevent a line from being read is if the top-of-file declared count is smaller than the following number of lines... and in such case, reading exactly the declared number of vectors is still arguably a reasonable thing to do, and would also match the behavior of the binary=True branch.)

Copy link
Owner

@piskvorky piskvorky Aug 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; I meant why the line split.

islice & enumerate seem more idiomatic and also safer (in case vocab_size > len(fin)).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the moment, the xrange approach was to mirror the way the binary-branch worked, where changing vocab_size to limit already had the desired effect. Considering it against islice, it's probably better for a mismatch to generate an error, than just silently accept fewer-than-the-declared count of vectors.

But, thinking of this made me realize the code wasn't yet handling a limit > the available count (so that's now been fixed), and the binary path could reach a busy-hang if reaching EOF while trying to read its token char-by-char (so both it and the text-path now recognize when a read returns nothing, indicating file-end-before-declared-count-reached, and raise an EOFError).

@piskvorky
Copy link
Owner

piskvorky commented Aug 10, 2016

The change is desirable and code looks fine to me, thanks.

Final decision and merge up to @tmylk (I know he's planning a new release, don't know whether there's a "merge freeze").

tmylk pushed a commit that referenced this pull request Aug 14, 2016
@tmylk
Copy link
Contributor

tmylk commented Aug 14, 2016

Resolved conflicts, squashed trivial commits and merged in 15ee57f

@tmylk tmylk closed this Aug 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants