GitHub - slvnwhrl/GerSoMeTokenExtractor: This repository provides a spacy pipeline with custom components for extracting tokens from German social media texts.

Introduction

This repository provides a token extractor for German social media texts. The extractor can be used to reduce texts to a list of selected (e.g., with a specific POS tag or a) and/or standardized (e.g., lemmatized). This can be useful, for instance, for topic modeling.

The code is based on a spacy pipeline with custom components that are especially suited for German social media texts. The choices for the components are partially motivated by this analysis: Evaluating Off-the-Shelf NLP Tools for German.

More specifically, it uses the following components:

SoMaJo for tokenization
SoMeWeTa for POS-tagging
TreeTagger for lemmatization

Installation

To install this repository, run the following code:

https://github.com/slvnwhrl/GerSoMeTokenExtractor.git
cd GerSoMeTokenExtractor
pip install .
# or pip install -e . for an editable install

And also make sure that you have installed all dependencies:

SoMeWeTa: you have to download a model here. Alterantively, you can train a model yourself. Please also have a look at the license of the models (e.g., the "German web and social media" model should only be used for research).
TreeTagger: follow the instructions for the German model here.

Usage

To use the extractor, you simple have to import and instantiate the GerSoMeTokenTextractor class:

from ger_some_token_extractor import GerSoMeTokenExtractor

someweta_model_path = "path/to/the/model"

# This might take a couple of seconds to load all the components.
extractor = GerSoMeTokenExtractor(
    someweta_model_path,
    lowercase=True,
    lemmatize=True,
    remove_pos=["$,", "$(", "$."]
)

sentences = [
    "Das ist ein Test... Cool, oder?;)",
    "Guten Nacht\n\n\n🪐💫⭐️",
]

# This processes a single text.
print(extractor.extract_text(sentences[0]))
# Output: ['test', 'cool', ';)']

# This processes a list of texts.
for tokens in extractor.extract_texts(sentences):
    print(tokens)
# Output: ['test', 'cool', ';)'], ['gut', 'nacht', '🪐', '💫', '⭐️']

The GerSoMeTokenExtractor can be instantiated with a couple of parameters:

attribute	description
someweta_model_path (str)	Path to the model for the SoMeWeTa component.
split_sentences (bool)	Whether sentences should be split (resulting in multiple lists, i.e., on list of tokens per sentence).
remove_stopwords (bool)	Wheter stopwords are removed from the output. Uses the spacy default stopwords for German (case insensitive).
lowercase (bool)	Wheter the returned tokens are lowercased. Defaults to False.
lemmatize (bool)	Whether the returned tokens are lemmatized. Defaults to False.
remove_tokens (list[str], optional)	A list of tokens that should be excluded (case sensitive).
remove_pos (list[str], optional)	A list of POS tags based on which tagged words should be excluded.

Note that the POS tags depend on the used model. For instance, SoMeWeTa's German web and social media model uses the Tiger schema.

Credits

If you use this library, please also make sure to give credits to the authors of the libraries on which the custom spacy components in this library are built! --> SoMaJo, SoMeWeTa, TreeTagger. And also make sure that you respect the respective licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
ger_some_token_extractor		ger_some_token_extractor
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Installation

Usage

Credits

About

Languages

License

slvnwhrl/GerSoMeTokenExtractor

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation

Usage

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Languages