How to get the word embedding after pre-training? #60

mfxss · 2018-11-06T02:56:10Z

Hi,
I am excited on this great model. And I want to get the word embedding . Where shold I find the file from output or should I change to code to do this?
Thanks,
Yuguang

jacobdevlin-google · 2018-11-06T03:04:16Z

If you want to get the contextual embeddings (like ELMo) see the section here.

If you want the actual word embeddings, the word->id mapping is just the index of the row in vocab.txt, and the embedding matrix is in bert_model.ckpt with the variable name bert/embeddings/word_embeddings.

mfxss · 2018-11-06T03:50:48Z

And I download your released model of chinese_L-12_H-768_A-12. In vocab.txt, I found some token such as
[unused1] [CLS][SEP][MASK] <S> <T> .
What do these tokens mean?

jacobdevlin-google · 2018-11-06T03:52:39Z

The [CLS], [SEP] and [MASK] tokens are used as described in the paper and README. The [unused] tokens were not used in our model and are randomly initialized.

mfxss · 2018-11-06T07:40:17Z

What is your training data of chinese_L-12_H-768_A-12? And what is it's size?

jacobdevlin-google · 2018-11-06T18:20:14Z

It's Chinese wikipedia with both Traditional and Simplified characters.

imgarylai · 2019-02-10T19:10:02Z

Hello @mfxss ,
Not sure if you still have problem to get the word embedding from BERT. I implement a BERT embedding library which makes you can get word embedding in a programatic way.

https://github.com/imgarylai/bert-embedding

Because I'm working closely with mxnet & gluonnlp team, my implementation is done by using mxnet and gluonnlp. However, I am trying to implement it in all other different frameworks.

Hope my works can help you.

rainorangelemon · 2020-01-04T22:43:33Z

Hey guys, if you don't want to install an extra module, here is an example:

BERT_PATH = 'HOME_DIR/bert_en_uncased_L-12_H-768_A-12'

import tensorflow as tf
imported = tf.saved_model.load(BERT_PATH)

for i in imported.trainable_variables:
    if i.name == 'bert_model/word_embeddings/embeddings:0':
        embeddings = i

And embeddings is the tensor of word embedding that you want!

arjunrajanna · 2020-08-19T21:27:36Z

Hi @jacobdevlin-google Thanks for the pointers. I see the output with the extract_features.py gives subword representations. I'm sure to be missing something but my question is how can we get a word (not subword) representation instead? Thanks in advance for your help!

mathshangw · 2022-01-09T08:19:49Z

Hi @jacobdevlin-google Thanks for the pointers. I see the output with the extract_features.py gives subword representations. I'm sure to be missing something but my question is how can we get a word (not subword) representation instead? Thanks in advance for your help!

Excuse me did you find a solution for word not subword , please

jacobdevlin-google closed this as completed Nov 6, 2018

HoaiDuyLe mentioned this issue Jan 23, 2019

BERT vs Word2vec #362

Open

This was referenced Mar 29, 2019

Domain Specific Pre-training Model naver/biobert-pretrained#4

Closed

Retreiving Embedding vector dmis-lab/biobert#23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get the word embedding after pre-training? #60

How to get the word embedding after pre-training? #60

mfxss commented Nov 6, 2018

jacobdevlin-google commented Nov 6, 2018

mfxss commented Nov 6, 2018 •

edited

Loading

jacobdevlin-google commented Nov 6, 2018

mfxss commented Nov 6, 2018 •

edited

Loading

jacobdevlin-google commented Nov 6, 2018

imgarylai commented Feb 10, 2019

rainorangelemon commented Jan 4, 2020 •

edited

Loading

arjunrajanna commented Aug 19, 2020 •

edited

Loading

mathshangw commented Jan 9, 2022

How to get the word embedding after pre-training? #60

How to get the word embedding after pre-training? #60

Comments

mfxss commented Nov 6, 2018

jacobdevlin-google commented Nov 6, 2018

mfxss commented Nov 6, 2018 • edited Loading

jacobdevlin-google commented Nov 6, 2018

mfxss commented Nov 6, 2018 • edited Loading

jacobdevlin-google commented Nov 6, 2018

imgarylai commented Feb 10, 2019

rainorangelemon commented Jan 4, 2020 • edited Loading

arjunrajanna commented Aug 19, 2020 • edited Loading

mathshangw commented Jan 9, 2022

mfxss commented Nov 6, 2018 •

edited

Loading

mfxss commented Nov 6, 2018 •

edited

Loading

rainorangelemon commented Jan 4, 2020 •

edited

Loading

arjunrajanna commented Aug 19, 2020 •

edited

Loading