Skip to content
@vm01-prejudice-bias-hotel-reviews

Detecting Prejudice and Bias in Hotel Reviews

MSE VM01 project.

VM01 - Detecting Prejudice and Bias in Hotel Reviews

This organization contains all repositories for the MSE VM01 called "Detecting Prejudice and Bias in Hotel Reviews".

Abstract

The amount of prejudice and hate speech found online is increasing rapidly and is not expected to slow down soon. Such harmful statements can be found anywhere, from social media posts to reviews. Being exposed to such prejudice can lead to serious societal problems. Currently, online platforms heavily rely on human content moderation to mitigate this problem. However, depending on individuals to moderate this ever-growing load of online content is not feasible. Thus these platforms look to machine learning to develop a solution that helps them address these problems.

This project aims to detect prejudicial statements in hotel reviews that refer to a nationality. For this purpose, a dataset of 1.4M hotel reviews was given. However, since the data does not include any labels or indications if a review contains a prejudiced statement, it can not be used directly to train a supervised classifier.

A pipeline was designed to detect prejudiced reviews by breaking the overall task into smaller sub-tasks. This workflow first checks if a review contains any nationality reference using a Bi-LSTM model, trained on a supervised nationality detection dataset and achieved a validation and test F1 score of around 90%. Then a model decides if a review contains any form of hateful statements, which is done by fine-tuning a large LM on hate speech detection challenges, such as GermEval-2018, HASOC-2019 and GermEval-2021. To increase the performance on these hate speech datasets and improve the generalization to the hotel domain, a domain-specific LM was trained on the hotel review dataset, called HotelBERT. The fine-tuned HotelBERT model performed better than the challenge winner in two out of three instances. The resulting pipeline indicated that it might be able to detect harmful statements at the intersection of hate speech and prejudice. However, to thoroughly evaluate its performance, a labelled prejudice dataset would be needed, which is not available at the time of writing.

The task of prejudice detection is hard to model because it is highly subjective and thus, getting a high-quality dataset is challenging. However, the task can be broken down into sub-tasks with existing datasets to better model the problem without relying too much on an annotated prejudice dataset.

Popular repositories Loading

  1. .github .github Public

Repositories

Showing 1 of 1 repositories
  • .github Public
    vm01-prejudice-bias-hotel-reviews/.github’s past year of commit activity
    0 0 0 0 Updated Dec 15, 2021

Top languages

Loading…

Most used topics

Loading…