New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Pipeline to cleanup and normalise content of txt files #210

Closed

2 tasks

Tracked by

ktagowski opened this issue Feb 10, 2022 · 1 comment

Closed

2 tasks

Tracked by

Pipeline to cleanup and normalise content of txt files #210

ktagowski opened this issue Feb 10, 2022 · 1 comment

Collaborator

ktagowski commented Feb 10, 2022 •

edited

Loading

We want to prepare cleanup and normalisation methods for any dirty alike text. Imagine texts after OCR.

Cleanup things such as: many new lines; sentences divided in couple of lines; strange encodings symbols; divided words into chunks
Consider checking if content is written in one language or multiple and optionally remove not main language content by default everything which is not in Polish should be removed.

Please sync with @laugustyniak or @pedrito87 in case of cleaning after OCR texts.

ktagowski mentioned this issue

Add keyword extraction task to library #196

Open

4 tasks

Collaborator

laugustyniak commented Feb 13, 2023

Dropping due to inactivity and direction shifts

laugustyniak closed this as not planned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment