datasets: add possibility to use custom preprocessing function for HIPE-2022 #2708

stefan-it · 2022-04-06T13:59:18Z

Hi,

this PR adds the possibility to use a custom preprocessing function for the HIPE-2022.

This would e.g. allow to use custom de-hyphenation or label normalization extensions defined in the preprocessing function.

…PE-2022

stefan-it · 2022-04-06T14:29:38Z

When the dataset already exists, and an own preprocessing is used afterwards, please make sure, that the old base path is deleted.

The following code shows how to use an own preprocessing function:

from flair.datasets import NER_HIPE_2022

from pathlib import Path

def own_prepare_corpus(
    file_in: Path, file_out: Path, eos_marker: str, document_separator: str, add_document_separator: bool
):
    with open(file_in, "rt") as f_p:
        lines = f_p.readlines()

    with open(file_out, "wt") as f_out:
        # Add missing newline after header
        f_out.write(lines[0] + "\n")

        for line in lines[1:]:
            if line.startswith(" \t"):
                # Workaround for empty tokens
                continue

            line = line.strip()

            # Add "real" document marker
            if add_document_separator and line.startswith(document_separator):
                f_out.write("-DOCSTART- O\n\n")

            f_out.write(line + "\n")

            if eos_marker in line:
                f_out.write("\n")
    
    print("Own function used, awesome!!!")

corpus = NER_HIPE_2022(dataset_name="ajmc", language="en", preproc_fn=own_prepare_corpus)

stefan-it · 2022-04-07T09:22:21Z

Here's a real-world example that performs de-hyphenation of the HIPE-2020, NewsEye and SONAR datasets using an own preprocessing function:

https://github.com/dbmdz/clef-hipe/blob/main/experiments/clef-hipe-2022/utils.py

alanakbik · 2022-04-10T11:57:41Z

Thanks for adding this @stefan-it!

stefan-it added 2 commits April 6, 2022 15:57

datasets: add possibility to use custom preprocessing function for HI…

4fbfbe1

…PE-2022

datasets: fix mypy error for HIPE-2022 preprocessing function

71f231b

stefan-it added 2 commits April 6, 2022 16:49

datasets: revert self from HIPE-2022 preprocessing fn

71d03ad

datasets: fix preprocessing function handling in HIPE-2022

2f846b8

alanakbik merged commit 1fe18be into master Apr 10, 2022

alanakbik deleted the hipe-2022-preprocessing-fn branch April 10, 2022 11:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets: add possibility to use custom preprocessing function for HIPE-2022 #2708

datasets: add possibility to use custom preprocessing function for HIPE-2022 #2708

stefan-it commented Apr 6, 2022

stefan-it commented Apr 6, 2022

stefan-it commented Apr 7, 2022 •

edited

Loading

alanakbik commented Apr 10, 2022

datasets: add possibility to use custom preprocessing function for HIPE-2022 #2708

datasets: add possibility to use custom preprocessing function for HIPE-2022 #2708

Conversation

stefan-it commented Apr 6, 2022

stefan-it commented Apr 6, 2022

stefan-it commented Apr 7, 2022 • edited Loading

alanakbik commented Apr 10, 2022

stefan-it commented Apr 7, 2022 •

edited

Loading