Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datasets: add possibility to use custom preprocessing function for HIPE-2022 #2708

Merged
merged 4 commits into from
Apr 10, 2022

Conversation

stefan-it
Copy link
Member

Hi,

this PR adds the possibility to use a custom preprocessing function for the HIPE-2022.

This would e.g. allow to use custom de-hyphenation or label normalization extensions defined in the preprocessing function.

@stefan-it
Copy link
Member Author

When the dataset already exists, and an own preprocessing is used afterwards, please make sure, that the old base path is deleted.

The following code shows how to use an own preprocessing function:

from flair.datasets import NER_HIPE_2022

from pathlib import Path

def own_prepare_corpus(
    file_in: Path, file_out: Path, eos_marker: str, document_separator: str, add_document_separator: bool
):
    with open(file_in, "rt") as f_p:
        lines = f_p.readlines()

    with open(file_out, "wt") as f_out:
        # Add missing newline after header
        f_out.write(lines[0] + "\n")

        for line in lines[1:]:
            if line.startswith(" \t"):
                # Workaround for empty tokens
                continue

            line = line.strip()

            # Add "real" document marker
            if add_document_separator and line.startswith(document_separator):
                f_out.write("-DOCSTART- O\n\n")

            f_out.write(line + "\n")

            if eos_marker in line:
                f_out.write("\n")
    
    print("Own function used, awesome!!!")

corpus = NER_HIPE_2022(dataset_name="ajmc", language="en", preproc_fn=own_prepare_corpus)

@stefan-it
Copy link
Member Author

stefan-it commented Apr 7, 2022

Here's a real-world example that performs de-hyphenation of the HIPE-2020, NewsEye and SONAR datasets using an own preprocessing function:

https://github.com/dbmdz/clef-hipe/blob/main/experiments/clef-hipe-2022/utils.py

@alanakbik
Copy link
Collaborator

Thanks for adding this @stefan-it!

@alanakbik alanakbik merged commit 1fe18be into master Apr 10, 2022
@alanakbik alanakbik deleted the hipe-2022-preprocessing-fn branch April 10, 2022 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants