This repository contains the solution to the first task of the Elimination Test - the first stage of BEST Coding Marathon 2023.
In the internet age, we increasingly encounter filtering of comments or chat in terms of offensive content. In the case of many websites, this is due to the need to maintain a certain level of culture, due to the requirements of advertisers who do not want their brand to be associated with "inappropriate environments". Nevertheless, the filters currently in use still do not work perfectly - their creators, in order to limit the number of "false positives", often settle for limited solutions. Furthermore, internet users who want to bypass censorship make matters more difficult by using tricks such as replacing certain letters in words with similarly looking signs, or writing in a not entirely grammatically correct way. As a result, the initially trivial problem of removing offensive content becomes significantly more difficult (for example, a chat filter introduced some time ago in one of the popular games did not allow the use of the word night because of the alleged similarity to a very unpleasant word).
The goal of the task is to implement a filter that replaces letters of Polish swear words with asterisks (*).
As our input is a sentence, we first need to split it into words by using the split()
method. Then, we can apply some preprocessing on the words.
We apply the following preprocessing steps in given order:
- Turn all letters into lowercase, i.e.
KuRwA
->kurwa
- Replace sequences of characters with their equivalents, i.e
kurvva
->kurwa
,shmata
->szmata
,jebanom
->jebaną
- Replace characters with their equivalents, i.e
qvrw@
->kurwa
,d21wk4
->dziwka
,$pi3®dal@©
->spierdalac
- Remove all non-alphabetic characters, i.e.
kurwa!
->kurwa
,d.z,i.w^k.#a??
->dziwka
,&spier%%dal;ac
->spierdalac
- Remove repeated characters, i.e.
kurwwwwwwwwwwa
->kurwa
,dziiiwkkaaaa
->dziwka
,spieeerdaalac
->spierdalac
- Lemmatize the words to their base form so that inflections are removed, i.e.
kurwami
->kurwa
,dziwce
->dziwka
,hujowi
->huj
,spierdalając
->spierdalać
After preprocessing, we can check if the word is a swear word.
Algorithm for checking if a word is a swear word:
For every blacklisted word:
- If the Levenshtein distance between the word and the blacklisted word is less than 2, the word is a swear word.
- If the blacklisted word is a substring of the word, the word is a swear word.
- If similarity between the word and the blacklisted word is greater than 0.95, the word is a swear word.
If the word is a swear word, we replace all letters with asterisks. Otherwise, we leave the word unchanged.
Explantions:
The Levenshtein distance is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. We use similarity
function from spaCy library that returns a number between 0 and 1, where 1 means that the words are identical. We set the threshold to 0.95, because we want to avoid false positives.
After filtering, we need to join the words back into a sentence. We do this by joining the words with a space character.
- Clone the repository with
git clone https://github.com/kjedrasz2137/EliminationTest.git
or download the zip file - Install dependencies with
pip install -r requirements.txt
- Run the program with
python src/main.py
Type your text in the input field and click the button. The result will be displayed in the output field. Please be aware that this can take a while, depending on the length of the text. Additionally, you can add your own words to the list of swear words by appending them to the black list field. Please note that the words must be separated by a comma.