Retreiving Embedding vector #23

pyturn · 2019-04-01T06:48:15Z

Can anyone suggest me how to get embedding vector using Biobert ? What exactly i am looking at is if I give text input to it, then I want back embedding vector for (sentence) or embedding vector (for word) . Any of these will work for me ?

jhyuklee · 2019-04-01T07:01:19Z

Hi @pyturn,

for getting embedding vectors from BioBERT, see this issue google-research/bert#60 from BERT repository.

Thanks.

pyturn · 2019-04-01T11:07:21Z

Hi @pyturn,

for getting embedding vectors from BioBERT, see this issue google-research/bert#60 from BERT repository.

Thanks.

Thanks for responding back. still it's not clear to me how to get the embeddings of words/sentences.

jhyuklee · 2019-04-02T01:29:31Z

See this python script (https://github.com/google-research/bert/blob/master/extract_features.py), where you can easily adopt BioBERT, too.
For sentence embeddings, you can use [CLS] token embeddings of BERT trained on sentence classification.

Thank you.

pyturn · 2019-04-04T04:51:22Z

See this python script (https://github.com/google-research/bert/blob/master/extract_features.py), where you can easily adopt BioBERT, too.
For sentence embeddings, you can use [CLS] token embeddings of BERT trained on sentence classification.

Thank you.

Thanks @jhyuklee , finally able to get embedding vectors using bio-bert also.

Santosh-Gupta · 2019-04-13T19:22:53Z

Hello, I am trying to figure out a way to retrieve the sentence embedding in a programatic way.

I am trying to this in a notebook. So far I cloned the repository and loaded the weights, but I don't know how to get the sentence/paragraph vector.

This is my code so far

import sys
import tensorflow as tf

!test -d bioBert_repo || git clone https://github.com/dmis-lab/biobert bioBert_repo
if not 'bioBert_repo' in sys.path:
  sys.path += ['bioBert_repo']

import extract_features

with tf.Session(graph=graph) as session:
 
   saver.restore(session, 'BioBert.ckpt' )

I know this is based on the original BERT code. For regular BERT they have you use tf.hub, but I'm guessing the setup is pretty similar. This is my code for regular BERT

pip install bert-tensorflow

import tensorflow as tf
import tensorflow_hub as hub

import bert
from bert import run_classifier
from bert import optimization
from bert import tokenization

import pandas as pd

from tensorflow import keras
import os
import re

from tensorflow.keras import backend as K

from bert.tokenization import FullTokenizer

bert_path = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"

sess = tf.Session()

bert_module = hub.Module(
  bert_path,
  trainable=True)

#Basically this is a function to convert text into a format BERT understands
def bertInputsFromText(text):
.
.
.

bert_inputs = bertInputsFromText("This is a test sentence")

sentence_embedding= bert_module(inputs=bert_inputs, signature="tokens", as_dict=True)[
        "pooled_output"

So I'm guessing my question boils down to, what to use as an equivalent to bert_module

pyturn · 2019-04-16T08:53:50Z

Hey @Santosh-Gupta you can use following piece of function to get the sentence embedding. It's the average of token level embedding.

def get_bio_bert_embedding(sentence):
  
  outF = open("/input.txt", "w")
  outF.write(sentence)
  outF.close()
  
  os.system('python3 extract_features.py \
   --input_file=/input.txt \
   --vocab_file=/content/biobert/biobert_pubmed/vocab.txt \
   --bert_config_file=/content/biobert/biobert_pubmed/bert_config.json \
   --init_checkpoint=/content/biobert/biobert_pubmed/biobert_model.ckpt \
   --output_file=/output.jsonl')

  with open('/output.jsonl') as f:
    d = json.load(f)
    
  sentence_vector = np.zeros(768).tolist()
  for i in range (1, len(d['features'])-1):
    sentence_vector = sentence_vector + d['features'][i]['layers'][0]['values']
  
  number_of_tokens = len(d['features']) - 2
  for elem in sentence_vector:
    elem = elem/number_of_tokens
  return sentence_vector

phosseini · 2019-06-19T21:20:36Z

Hey @Santosh-Gupta you can use following piece of function to get the sentence embedding. It's the average of token level embedding.

def get_bio_bert_embedding(sentence):
  
  outF = open("/input.txt", "w")
  outF.write(sentence)
  outF.close()
  
  os.system('python3 extract_features.py \
   --input_file=/input.txt \
   --vocab_file=/content/biobert/biobert_pubmed/vocab.txt \
   --bert_config_file=/content/biobert/biobert_pubmed/bert_config.json \
   --init_checkpoint=/content/biobert/biobert_pubmed/biobert_model.ckpt \
   --output_file=/output.jsonl')

  with open('/output.jsonl') as f:
    d = json.load(f)
    
  sentence_vector = np.zeros(768).tolist()
  for i in range (1, len(d['features'])-1):
    sentence_vector = sentence_vector + d['features'][i]['layers'][0]['values']
  
  number_of_tokens = len(d['features']) - 2
  for elem in sentence_vector:
    elem = elem/number_of_tokens
  return sentence_vector

Thanks for the code. I slightly modified the script and wrote the following code. Assuming that we have a collection of documents stored in a text file such that each document is stored in one line, the following code gives you a list of embeddings (embedding_vectors) of these documents:

import os
import json
import pickle
import numpy as np

# extract_features.py is a BERT class
os.system('python3 ' + ' extract_features.py \
            --input_file=input.txt \
            --vocab_file=vocab.txt \
            --bert_config_file=bert_config.json \
            --init_checkpoint=model.ckpt-1000000 \
            --output_file=output.json')

embedding_vectors = []
print("[progress] Processed records: ")
with open('output.json') as f:
    # each line is a json which has embedding information of a single document/description
    for line in f:
        d = json.loads(line)
        # 768 is hidden size
        doc_vector = np.zeros(768)
        
        # starting from 1 to skip [CLS] and to -1 to skip [SEP]
        for i in range (1, len(d['features'])-1):
            # feature vector of the current token
            feature_vector = np.array(d['features'][i]['layers'][0]['values'])
        
            # adding to the document vector
            doc_vector = doc_vector + feature_vector
    
        # -2 is for excluding [CLS] and [SEP] tokens
        number_of_tokens = len(d['features']) - 2
    
        # since we want to compute the average of vector representations of tokens
        doc_vector = np.divide(doc_vector, number_of_tokens)
    
        embedding_vectors.append(doc_vector)

print("[log] Saving the embedding vectors into file...")
# saving the embedding vectors in a pickle file
with open("embedding_vector.pkl", "wb") as f:
    pickle.dump(embedding_vectors, f)

n = len(embedding_vectors)
dim = len(embedding_vectors[0])
print("Number of vectors: " + str(n))
print("Vector dimension: " + str(dim))

Aasif-Multani · 2019-09-06T11:24:37Z

you can use:
https://github.com/hanxiao/bert-as-service
just pass biobert pre trained file during the starting of server

Santosh-Gupta · 2019-09-07T17:19:10Z

The huggingface BERT makes it very easy to load biobert and sent text to get the embeddings.

bajajahsaas · 2019-10-24T01:21:11Z

Hi @Santosh-Gupta, can you help me use huggingface BERT to extractfeatures (sentence embeddings). I am able to load bioBERT pre-trained model and convert it to PyTorch implementation. Now, the latest version of huggingface library doesn't seem to have extract_features.py file. Am I missing something?

vikasFid · 2020-05-05T11:33:10Z

Hi @Santosh-Gupta, can you help me use huggingface BERT to extractfeatures (sentence embeddings). I am able to load bioBERT pre-trained model and convert it to PyTorch implementation. Now, the latest version of huggingface library doesn't seem to have extract_features.py file. Am I missing something?

Were you able to find a solution for this?

Santosh-Gupta · 2020-05-05T20:08:03Z

There are some community biobert models now on hf

https://huggingface.co/models?search=biobert

Suiji12 · 2024-03-09T06:12:36Z

Hello, I would like to ask you a question. I am currently working on RAG, but after vectorization with some general embedding model, it is not a problem if the amount of data is small, but when the amount of data is large, there will be illusion problem. I think it is necessary to build a vertical field of embedding model, do you have any better way?

chenzhwsysu57 · 2024-06-12T06:24:18Z

Hello, I would like to ask you a question. I am currently working on RAG, but after vectorization with some general embedding model, it is not a problem if the amount of data is small, but when the amount of data is large, there will be illusion problem. I think it is necessary to build a vertical field of embedding model, do you have any better way?

What specific domain are you working on?
Illusion problems might get fixed using cascade/multi-level small models when facing large dataset

Suiji12 · 2024-06-12T06:33:06Z

Hello, I would like to ask you a question. I am currently working on RAG, but after vectorization with some general embedding model, it is not a problem if the amount of data is small, but when the amount of data is large, there will be illusion problem. I think it is necessary to build a vertical field of embedding model, do you have any better way?

What specific domain are you working on? Illusion problems might get fixed using cascade/multi-level small models when facing large dataset

On mouse knowledge

lordpba · 2024-06-12T06:53:29Z

chunk size?

jhyuklee mentioned this issue Apr 1, 2019

Domain Specific Pre-training Model naver/biobert-pretrained#4

Closed

jhyuklee closed this as completed Apr 4, 2019

jhyuklee mentioned this issue May 23, 2019

adding additional features for QA task #27

Closed

jhyuklee mentioned this issue Jul 24, 2019

Can't get word embedding #37

Closed

wonjininfo mentioned this issue Jan 20, 2020

How do I run NER predictions in pytorch? #77

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retreiving Embedding vector #23

Retreiving Embedding vector #23

pyturn commented Apr 1, 2019

jhyuklee commented Apr 1, 2019

pyturn commented Apr 1, 2019

jhyuklee commented Apr 2, 2019

pyturn commented Apr 4, 2019

Santosh-Gupta commented Apr 13, 2019 •

edited

Loading

pyturn commented Apr 16, 2019

phosseini commented Jun 19, 2019 •

edited

Loading

Aasif-Multani commented Sep 6, 2019

Santosh-Gupta commented Sep 7, 2019

bajajahsaas commented Oct 24, 2019

vikasFid commented May 5, 2020

Santosh-Gupta commented May 5, 2020

Suiji12 commented Mar 9, 2024

chenzhwsysu57 commented Jun 12, 2024

Suiji12 commented Jun 12, 2024

lordpba commented Jun 12, 2024

Retreiving Embedding vector #23

Retreiving Embedding vector #23

Comments

pyturn commented Apr 1, 2019

jhyuklee commented Apr 1, 2019

pyturn commented Apr 1, 2019

jhyuklee commented Apr 2, 2019

pyturn commented Apr 4, 2019

Santosh-Gupta commented Apr 13, 2019 • edited Loading

pyturn commented Apr 16, 2019

phosseini commented Jun 19, 2019 • edited Loading

Aasif-Multani commented Sep 6, 2019

Santosh-Gupta commented Sep 7, 2019

bajajahsaas commented Oct 24, 2019

vikasFid commented May 5, 2020

Santosh-Gupta commented May 5, 2020

Suiji12 commented Mar 9, 2024

chenzhwsysu57 commented Jun 12, 2024

Suiji12 commented Jun 12, 2024

lordpba commented Jun 12, 2024

Santosh-Gupta commented Apr 13, 2019 •

edited

Loading

phosseini commented Jun 19, 2019 •

edited

Loading