Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retreiving Embedding vector #23

Closed
pyturn opened this issue Apr 1, 2019 · 16 comments
Closed

Retreiving Embedding vector #23

pyturn opened this issue Apr 1, 2019 · 16 comments

Comments

@pyturn
Copy link

pyturn commented Apr 1, 2019

Can anyone suggest me how to get embedding vector using Biobert ? What exactly i am looking at is if I give text input to it, then I want back embedding vector for (sentence) or embedding vector (for word) . Any of these will work for me ?

@jhyuklee
Copy link
Member

jhyuklee commented Apr 1, 2019

Hi @pyturn,

for getting embedding vectors from BioBERT, see this issue google-research/bert#60 from BERT repository.

Thanks.

@pyturn
Copy link
Author

pyturn commented Apr 1, 2019

Hi @pyturn,

for getting embedding vectors from BioBERT, see this issue google-research/bert#60 from BERT repository.

Thanks.

Thanks for responding back. still it's not clear to me how to get the embeddings of words/sentences.

@jhyuklee
Copy link
Member

jhyuklee commented Apr 2, 2019

See this python script (https://github.com/google-research/bert/blob/master/extract_features.py), where you can easily adopt BioBERT, too.
For sentence embeddings, you can use [CLS] token embeddings of BERT trained on sentence classification.

Thank you.

@pyturn
Copy link
Author

pyturn commented Apr 4, 2019

See this python script (https://github.com/google-research/bert/blob/master/extract_features.py), where you can easily adopt BioBERT, too.
For sentence embeddings, you can use [CLS] token embeddings of BERT trained on sentence classification.

Thank you.

Thanks @jhyuklee , finally able to get embedding vectors using bio-bert also.

@jhyuklee jhyuklee closed this as completed Apr 4, 2019
@Santosh-Gupta
Copy link

Santosh-Gupta commented Apr 13, 2019

Hello, I am trying to figure out a way to retrieve the sentence embedding in a programatic way.

I am trying to this in a notebook. So far I cloned the repository and loaded the weights, but I don't know how to get the sentence/paragraph vector.

This is my code so far

import sys
import tensorflow as tf

!test -d bioBert_repo || git clone https://github.com/dmis-lab/biobert bioBert_repo
if not 'bioBert_repo' in sys.path:
  sys.path += ['bioBert_repo']

import extract_features

with tf.Session(graph=graph) as session:
 
   saver.restore(session, 'BioBert.ckpt' )

I know this is based on the original BERT code. For regular BERT they have you use tf.hub, but I'm guessing the setup is pretty similar. This is my code for regular BERT

pip install bert-tensorflow

import tensorflow as tf
import tensorflow_hub as hub

import bert
from bert import run_classifier
from bert import optimization
from bert import tokenization

import pandas as pd

from tensorflow import keras
import os
import re

from tensorflow.keras import backend as K

from bert.tokenization import FullTokenizer

bert_path = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"

sess = tf.Session()

bert_module = hub.Module(
  bert_path,
  trainable=True)

#Basically this is a function to convert text into a format BERT understands
def bertInputsFromText(text):
.
.
.

bert_inputs = bertInputsFromText("This is a test sentence")

sentence_embedding= bert_module(inputs=bert_inputs, signature="tokens", as_dict=True)[
        "pooled_output"

So I'm guessing my question boils down to, what to use as an equivalent to bert_module

@pyturn
Copy link
Author

pyturn commented Apr 16, 2019

Hey @Santosh-Gupta you can use following piece of function to get the sentence embedding. It's the average of token level embedding.

def get_bio_bert_embedding(sentence):
  
  outF = open("/input.txt", "w")
  outF.write(sentence)
  outF.close()
  
  os.system('python3 extract_features.py \
   --input_file=/input.txt \
   --vocab_file=/content/biobert/biobert_pubmed/vocab.txt \
   --bert_config_file=/content/biobert/biobert_pubmed/bert_config.json \
   --init_checkpoint=/content/biobert/biobert_pubmed/biobert_model.ckpt \
   --output_file=/output.jsonl')

  with open('/output.jsonl') as f:
    d = json.load(f)
    
  sentence_vector = np.zeros(768).tolist()
  for i in range (1, len(d['features'])-1):
    sentence_vector = sentence_vector + d['features'][i]['layers'][0]['values']
  
  number_of_tokens = len(d['features']) - 2
  for elem in sentence_vector:
    elem = elem/number_of_tokens
  return sentence_vector

@phosseini
Copy link

phosseini commented Jun 19, 2019

Hey @Santosh-Gupta you can use following piece of function to get the sentence embedding. It's the average of token level embedding.

def get_bio_bert_embedding(sentence):
  
  outF = open("/input.txt", "w")
  outF.write(sentence)
  outF.close()
  
  os.system('python3 extract_features.py \
   --input_file=/input.txt \
   --vocab_file=/content/biobert/biobert_pubmed/vocab.txt \
   --bert_config_file=/content/biobert/biobert_pubmed/bert_config.json \
   --init_checkpoint=/content/biobert/biobert_pubmed/biobert_model.ckpt \
   --output_file=/output.jsonl')

  with open('/output.jsonl') as f:
    d = json.load(f)
    
  sentence_vector = np.zeros(768).tolist()
  for i in range (1, len(d['features'])-1):
    sentence_vector = sentence_vector + d['features'][i]['layers'][0]['values']
  
  number_of_tokens = len(d['features']) - 2
  for elem in sentence_vector:
    elem = elem/number_of_tokens
  return sentence_vector

Thanks for the code. I slightly modified the script and wrote the following code. Assuming that we have a collection of documents stored in a text file such that each document is stored in one line, the following code gives you a list of embeddings (embedding_vectors) of these documents:

import os
import json
import pickle
import numpy as np

# extract_features.py is a BERT class
os.system('python3 ' + ' extract_features.py \
            --input_file=input.txt \
            --vocab_file=vocab.txt \
            --bert_config_file=bert_config.json \
            --init_checkpoint=model.ckpt-1000000 \
            --output_file=output.json')

embedding_vectors = []
print("[progress] Processed records: ")
with open('output.json') as f:
    # each line is a json which has embedding information of a single document/description
    for line in f:
        d = json.loads(line)
        # 768 is hidden size
        doc_vector = np.zeros(768)
        
        # starting from 1 to skip [CLS] and to -1 to skip [SEP]
        for i in range (1, len(d['features'])-1):
            # feature vector of the current token
            feature_vector = np.array(d['features'][i]['layers'][0]['values'])
        
            # adding to the document vector
            doc_vector = doc_vector + feature_vector
    
        # -2 is for excluding [CLS] and [SEP] tokens
        number_of_tokens = len(d['features']) - 2
    
        # since we want to compute the average of vector representations of tokens
        doc_vector = np.divide(doc_vector, number_of_tokens)
    
        embedding_vectors.append(doc_vector)

print("[log] Saving the embedding vectors into file...")
# saving the embedding vectors in a pickle file
with open("embedding_vector.pkl", "wb") as f:
    pickle.dump(embedding_vectors, f)

n = len(embedding_vectors)
dim = len(embedding_vectors[0])
print("Number of vectors: " + str(n))
print("Vector dimension: " + str(dim))

@Aasif-Multani
Copy link

you can use:
https://github.com/hanxiao/bert-as-service
just pass biobert pre trained file during the starting of server

@Santosh-Gupta
Copy link

The huggingface BERT makes it very easy to load biobert and sent text to get the embeddings.

@bajajahsaas
Copy link

Hi @Santosh-Gupta, can you help me use huggingface BERT to extractfeatures (sentence embeddings). I am able to load bioBERT pre-trained model and convert it to PyTorch implementation. Now, the latest version of huggingface library doesn't seem to have extract_features.py file. Am I missing something?

@vikasFid
Copy link

vikasFid commented May 5, 2020

Hi @Santosh-Gupta, can you help me use huggingface BERT to extractfeatures (sentence embeddings). I am able to load bioBERT pre-trained model and convert it to PyTorch implementation. Now, the latest version of huggingface library doesn't seem to have extract_features.py file. Am I missing something?

Were you able to find a solution for this?

@Santosh-Gupta
Copy link

There are some community biobert models now on hf

https://huggingface.co/models?search=biobert

@Suiji12
Copy link

Suiji12 commented Mar 9, 2024

Hello, I would like to ask you a question. I am currently working on RAG, but after vectorization with some general embedding model, it is not a problem if the amount of data is small, but when the amount of data is large, there will be illusion problem. I think it is necessary to build a vertical field of embedding model, do you have any better way?

@chenzhwsysu57
Copy link

Hello, I would like to ask you a question. I am currently working on RAG, but after vectorization with some general embedding model, it is not a problem if the amount of data is small, but when the amount of data is large, there will be illusion problem. I think it is necessary to build a vertical field of embedding model, do you have any better way?

What specific domain are you working on?
Illusion problems might get fixed using cascade/multi-level small models when facing large dataset

@Suiji12
Copy link

Suiji12 commented Jun 12, 2024

Hello, I would like to ask you a question. I am currently working on RAG, but after vectorization with some general embedding model, it is not a problem if the amount of data is small, but when the amount of data is large, there will be illusion problem. I think it is necessary to build a vertical field of embedding model, do you have any better way?

What specific domain are you working on? Illusion problems might get fixed using cascade/multi-level small models when facing large dataset

On mouse knowledge

@lordpba
Copy link

lordpba commented Jun 12, 2024

chunk size?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants