Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predictions in raw data #5

Open
GuillermoJaca opened this issue Dec 1, 2020 · 6 comments
Open

Predictions in raw data #5

GuillermoJaca opened this issue Dec 1, 2020 · 6 comments

Comments

@GuillermoJaca
Copy link

Hello, I am wondering how predictions on raw data can be done. It is not documented at all for this and I think it's the primary use of the model.

@jhyuklee
Copy link
Member

jhyuklee commented Dec 3, 2020

Hi @GuillermoJaca, what do you mean by the raw data? I think the pre-processing will depend on the type of task you want.

@GuillermoJaca
Copy link
Author

I mean a normal biomedical text. The issue is that there is no .predict function, so the file run_ner.py has to be customized. What is the best way to do that? Which preprocessing should I use to get the best possible performance of the model taking into account that my task is NER ?

@mgavish
Copy link

mgavish commented Dec 11, 2020

Instruction on using the repo for inference is in the README under the NER section: https://github.com/dmis-lab/biobert#user-content-named-entity-recognition-ner:~:text=You%20can%20change%20the%20arguments%20as,using%20%2D%2Ddo_train%3Dfalse%20%2D%2Ddo_predict%3Dtrue%20for%20evaluating%20test.tsv.

The bigger challenge is completing inference without using the repo, ie, repo specific functions and methods.

@abhibisht89
Copy link

@GuillermoJaca for prediction you can directly use your fine tune model in huggingface transformer pipeline, some sample code below for you reference:

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("finetue_model_path")
model = AutoModelForTokenClassification.from_pretrained("finetue_model_path")
nlp=pipeline(task='ner',model=model,tokenizer=tokenizer,grouped_entities=True,ignore_subwords=True)
text="""he is feeing very sick"""
output=nlp(text)

Read more here on huggingface pipeline:
https://huggingface.co/transformers/main_classes/pipelines.html

@nowhyun
Copy link

nowhyun commented Jan 11, 2021

@abhibisht89
Thank you for your reply.

However, if tokenizer is specified as 'dmis-lab/biobert-v1.1', the ignore_subwords option cannot be specified as True.

Is there any other way?

@cutejue
Copy link

cutejue commented Mar 29, 2021

Hello, I wonder why the labels are the simple BIO in NER task, however, in the raw dataset (e.g. NCBI), the labels could be SpecificDisease, Modifier and so on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants