Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TokenClassificationEvaluator #167

Merged
merged 32 commits into from
Jul 21, 2022

Conversation

fxmarty
Copy link
Contributor

@fxmarty fxmarty commented Jun 29, 2022

This PR adds a TokenClassificationEvaluator, and refactors the evaluator.py into an evaluator/ subdirectory to avoid too long files, following https://github.com/huggingface/transformers/tree/main/src/transformers/pipelines .

There are no changes to the Evaluator base class and TextClassificationEvaluator, just refactoring.

This evaluator will fail for the following cases:

I am very much happy if you have propositions on how to improve this. Basically, most datasets on the Hub have the input column split by word, while the TokenClassificationPipeline expects a single string as input (no is_split_into_words option).

The hack I use as a workaround is to join all words into a single string, by default joining with " ".join(data["tokens"]) where data["tokens"] is something like ["I", "am", "an", "example", "."].

The problem with this approach is that the BatchEncoding.word_ids from the tokenizer output may yield word indexes that do not match the true word indexes, hence making the mapping with the references ner_tags impossible. For example, tokenizing Germany 's representative to the European Union with pipe.preprocess(), we may have ' and s having different word ids as in the issue I linked above.

The solution chosen is to tokenize each word (i.e. each item from the dataset input column) separately, in order to retrieve the index of the first token for all words. This approach, as stated above, may well break for languages as Chinese/Japanese where a token may be made of several words (in the sense of several items from the dataset input column, e.g. in https://huggingface.co/datasets/msra_ner ).

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 29, 2022

The documentation is not available anymore as the PR was closed or merged.

@ola13
Copy link
Contributor

ola13 commented Jun 30, 2022

Thanks for this @fxmarty! Indeed tokenization of languages which don't break on spaces is a challenging task of its own, it comes up in Chinese, Japanese, Korean but also Thai and to some extent Arabic (also probably many other languages). Do you know how this issue is handled by models which do token classification on these specific languages? Do we have examples of such models? I think it might be helpful to see how it's typically handled before we decide on a solution.

@lvwerra
Copy link
Member

lvwerra commented Jun 30, 2022

Hi @fxmarty, thanks so much for working on this. To make it a bit easier to review this PR and discuss options could you provide a few minimal examples for each case where the current evaluator works and does not work?

In general it is ok if the user needs to preprocess the data into the right format but it would be also great if some of the most popular formats kind of worked out of the box (e.g. people will try conll).

Maybe the NER evaluator should have a config format to distinguish:

  • offset: the inputs are provided as single string and the labels as a list of offset/span tuples.
  • IOB: the inputs are list of words and the labels as well

The first format shouldn't pose an issue with the pipeline, right? For the second case can't we use the tokenizer similar to how it's done in the transformers NER example? After the tokenize_and_align_labels you could in principle decode the tokens and be in the same format as offset (more or less), right?

For the cases where this fails we could expect the user to be able to transform the data into any of those two formats and it should also work. We can document how to use the evaluator with other formats in the documentation.

What do you think?

@fxmarty
Copy link
Contributor Author

fxmarty commented Jun 30, 2022

Thank you for your feedback!

Do you know how this issue is handled by models which do token classification on these specific languages? Do we have examples of such models? I think it might be helpful to see how it's typically handled before we decide on a solution.

So from what I looked up, I found some models for POS tagging in Japanese/Chinese, as an example https://huggingface.co/KoichiYasuoka/bert-base-japanese-upos trained on https://huggingface.co/datasets/universal_dependencies . It's actually not an issue because the tokenizer has no multi-ideogram tokens. See

from datasets import load_dataset
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

from transformers import pipeline

rawdata = load_dataset("universal_dependencies", "ja_modern")

tokenizer = AutoTokenizer.from_pretrained("KoichiYasuoka/bert-base-japanese-upos")
model = AutoModelForTokenClassification.from_pretrained("KoichiYasuoka/bert-base-japanese-upos")

print(rawdata["test"][1])

print("----------")

inp = rawdata["test"][1]["tokens"]
encoded = tokenizer.encode(inp, return_tensors="pt", is_split_into_words=True)
print(encoded)

inp = rawdata["test"][1]["text"]
encoded = tokenizer.encode(inp, return_tensors="pt")
print(encoded)

For other datasets, (e.g. https://huggingface.co/datasets/xtreme/ or https://huggingface.co/datasets/wikiann/) there seem to be no models on the Hub for the Chinese/Japanese splits. I guess there are plently of models there https://github.com/PaddlePaddle/PaddleNLP though.

offset: the inputs are provided as single string and the labels as a list of offset/span tuples.

Looking at the datasets for token-classification on the Hub, all of them on the first page have inputs as list of words and labels as well. Do you know of a popular dataset where labels are a list of offset/span tuples? I mentioned this case not being support but I could actually not find a relevant dataset for it. Maybe we don't need to support this case?

For the second case can't we use the tokenizer similar to how it's done in the transformers NER example?

Yes, this is probably the simpliest actually. Avoid the pipeline.preprocess() and use instead something similar to https://github.com/huggingface/transformers/blob/4f8361afe7b411ae2956d59a761264eef8db6ad8/examples/pytorch/token-classification/run_ner.py#L419-L453 . Then, use pipeline.forward() and pipeline.postprocess().

After the tokenize_and_align_labels you could in principle decode the tokens and be in the same format as offset (more or less), right?

I could be mistaken, but I don't think so (see huggingface/transformers#16438 (comment) ). It probably does not matter if we use the workflow I suggest above.

The only case left which could be an issue is if the model uses a tokenizer that may consider as a token several ideograms (e.g. 日本 can be a token), while the input data is a list of ideograms. In this case, we may have bad performance if we use a tokenizer with the is_split_into_words=True (or the pipeline.preprocess() on each word I initially proposed), as I understand no multi-ideogram token will be generated. For example, if the input is ["日", "本", "語", "じ", "ょ" "う", "ず"], 日本 will not be treated as a token. AFAIK there are no models on the Hub for such a case with token classification though.

What do you think?

@lvwerra
Copy link
Member

lvwerra commented Jul 1, 2022

Do you know of a popular dataset where labels are a list of offset/span tuples?

I think most labeling tools use this format for the exports - so it would be nice if we could add this format at some point. If it requires a ton of work we can address this in a subsequent PR.

Yes, this is probably the simpliest actually. Avoid the pipeline.preprocess() and use instead something similar to https://github.com/huggingface/transformers/blob/4f8361afe7b411ae2956d59a761264eef8db6ad8/examples/pytorch/token-classification/run_ner.py#L419-L453 . Then, use pipeline.forward() and pipeline.postprocess().

If possible I would like to avoid calling pipeline methods explicitly (except __call__ obviously). The reason we chose the pipeline is that it is quite a generic abstraction that can be easily adapted to other frameworks. If we call methods like forward or postprocess this will require somebody implementing another framework to do the same. Therefore it would be great if we could keep it as generic as possible.

See here: https://huggingface.co/docs/evaluate/main/en/custom_evaluator

AFAIK there are no models on the Hub for such a case with token classification though.

I don't think we need to worry about these cases too much now. Let's make the limitations are well documented.

@fxmarty
Copy link
Contributor Author

fxmarty commented Jul 1, 2022

Understood. Although I think it is not too critical to call preprocess, forward, postprocess as they should always be available with pipelines: https://huggingface.co/docs/transformers/v4.20.1/en/add_new_pipeline . What is critical is that I assumed that the output of preprocess is always a transformers.BatchEncoding, which is not a standard with pipelines. So this is not good.

After the tokenize_and_align_labels you could in principle decode the tokens and be in the same format as offset (more or less), right?

So I thought more about this later, and contrary to what I said above, I think what you suggest is very fine. The message I linked just stated that you can not retrieve the exact input text with .decode(), but actually I believe the decoded text will give the very same output if run again into the tokenizer. So given your suggestions, I propose the following workflow (self-contained script just to give an idea with some printing):

from transformers import pipeline
from datasets import load_dataset
from transformers import AutoModelForTokenClassification, AutoTokenizer

data = load_dataset("conll2003")

model_name = "elastic/distilbert-base-uncased-finetuned-conll03-english"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

kwargs = {"ignore_labels": []}
pipe = pipeline(task="token-classification", model=model, tokenizer=tokenizer, **kwargs)

inp = data["validation"][15]

print("Original input:", inp["tokens"])
print("Length original input:", len(inp["tokens"]))
tokenized_inp = tokenizer(inp["tokens"], is_split_into_words=True)

print("Length tokenized input:", len(tokenized_inp["input_ids"]))

word_ids = tokenized_inp.word_ids(0)
print("word_ids:", word_ids)

# the pipeline gives no output for tokens with `None` word ids
word_ids_no_none = [word_id for word_id in word_ids if word_id != None]

# the pipeline may give as output labeled tokens that are part of the same word, keep track
# of the indexing to match the true labels on words
index_tokens_word_start = []

for j, word_index in enumerate(word_ids_no_none):
    if j == 0:
        index_tokens_word_start.append(j)
    elif word_index != word_ids_no_none[j - 1]:
        index_tokens_word_start.append(j)

print("Length index_tokens_word_start:", len(index_tokens_word_start))

# we remove tokens corresponding to `None` word ids, i.e. tokens added by the tokenizer since
# we will tokenize again later on
token_indexes = [i for i in range(len(word_ids)) if word_ids[i] != None]

to_decode = [tokenized_inp["input_ids"][i] for i in token_indexes]

decoded = tokenizer.decode(to_decode)
print("Input to pipeline:", decoded)

tokenized_decoded = tokenizer(decoded)

assert tokenized_decoded["input_ids"] == tokenized_inp["input_ids"]

res = pipe(decoded)

print("Length pipeline output:", len(res))

printing

Original input: ['By', 'stumps', 'Kent', 'had', 'reached', '108', 'for', 'three', '.']
Length original input: 9
Length tokenized input: 12
word_ids: [None, 0, 1, 1, 2, 3, 4, 5, 6, 7, 8, None]
Length index_tokens_word_start: 9
Input to pipeline: by stumps kent had reached 108 for three.
Length pipeline output: 10

What do you think @lvwerra ? Here we do not call any methods from Pipeline explicitly.

To get the labels, we keep the same logic as in this PR, i.e. https://github.com/fxmarty/evaluate/blob/6ebf2bcaf3f8a8de2f027b122652681a3b3ac60b/src/evaluate/evaluator/token_classification.py#L212-L237

@fxmarty fxmarty force-pushed the add-token-classification-evaluator branch from 6ebf2bc to d8a5d88 Compare July 18, 2022 08:23
@fxmarty fxmarty force-pushed the add-token-classification-evaluator branch from 5a3cb14 to fbfbec7 Compare July 18, 2022 20:50
Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work - just a few minor comments. To be sure everything works as expected you could run the parity test on a few datasets and on larger subsets.

PS: The pipeline_call should now also return the performance metrics and thus you need to alsoupdate the tests since the performance metrics are then also in the returned dict (aa81085).

src/evaluate/evaluator/token_classification.py Outdated Show resolved Hide resolved
src/evaluate/evaluator/token_classification.py Outdated Show resolved Hide resolved
Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fxmarty, thanks for the changes - just a few nits. Did you run the parity tests on 2-3 datasets with larger sample size as well as checking the metric matches what some models report? Just to double check that it not only works correct on conll.

src/evaluate/evaluator/token_classification.py Outdated Show resolved Hide resolved
src/evaluate/evaluator/token_classification.py Outdated Show resolved Hide resolved
src/evaluate/evaluator/token_classification.py Outdated Show resolved Hide resolved
fxmarty and others added 3 commits July 20, 2022 17:51
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Copy link
Member

@lvwerra lvwerra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added three suggestions to make it compatible with the now merged device placement.

src/evaluate/evaluator/token_classification.py Outdated Show resolved Hide resolved
@fxmarty
Copy link
Contributor Author

fxmarty commented Jul 20, 2022

Thanks a lot for the review!

I ran the parity test as well on philschmid/distilroberta-base-ner-wikiann with https://huggingface.co/datasets/wikiann/viewer/en/validation , we do match.

I wanted to run sachaarbonel/bert-italian-cased-finetuned-pos with https://huggingface.co/datasets/xtreme/viewer/udpos.Italian/test , but it is more tricky as there is no run_pos.py script to compare to and this requires some changes to run_ner.py.

Looking for models to try out, I realized many have bad label2id mapping (like "LABEL_0", etc.).

fxmarty and others added 5 commits July 20, 2022 18:48
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
@lvwerra lvwerra merged commit 37bd06c into huggingface:main Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants