Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't get word embedding #37

Closed
happypanda5 opened this issue Jul 7, 2019 · 4 comments
Closed

Can't get word embedding #37

happypanda5 opened this issue Jul 7, 2019 · 4 comments

Comments

@happypanda5
Copy link

Hi, I am trying to get a word embedding vector for BioBERT, and compare it with the word embedding vector I get from BERT.

However, I haven't been successful in running BioBERT.

I have downloaded the weights from release v1.1-pubmed and after unzipping the weights into a folder, I run the following code

`out = open('prepoutput.json', 'w')

import os

os.system('python3 "/content/biobert/extract_features.py"
--input_file= "/content/biobert/sample_text.txt"
--vocab_file= "/content/biobert_v1.1_pubmed/vocab.txt"
--bert_config_file= "/content/biobert_v1.1_pubmed/bert_config.json"
--init_checkpoint= "/content/biobert_v1.1_pubmed/model.ckpt.index"
--output_file= "/content/prepoutput.json" ')`

The output is "256" and the file "preoutput.json" is empty.

Please guide me.

Unfortunately, my attempts at converting the weights from Pytorch wasn't successful either.

@jhyuklee
Copy link
Member

Hi @happypanda5,
Sorry for the late response. Maybe this comment in #23 can help.
Thanks.

@futong
Copy link

futong commented Jul 26, 2019

Hi, @jhyuklee ,
I also would like to get the word embedding. I took your advice #23 (comment) that I got all word embeddings of a sentence.
But we know the same words with different position have different contextual embeddings.
If I only get the word embedding, what should I do? Is there only one word per line as input? Or something else?
Looking forward to your reply soon.

@izuna385
Copy link

izuna385 commented Jul 26, 2019

I think you can try out, for example
https://github.com/huggingface/pytorch-transformers
Give vocab.txt and pytorch-converted BERTs weights, and sentences.
You can use BERT's last layer, or avarage vector of 12 + 1 layer, or something else for getting contextualized word embeddings.

@jhyuklee
Copy link
Member

jhyuklee commented Aug 8, 2019

Hi @futong,
the extract_features.py file gives you embeddings of last 'k' layers defined by the input argument (see

flags.DEFINE_string("layers", "-1,-2,-3,-4", "")
), and all the position/segment/wordpiece embeddings will be already included in the first layer.
Thanks.

@jhyuklee jhyuklee closed this as completed Aug 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants