Skip to content

Study and characterization of sarcasm in Reddit Messages - Data Mining course Proyect, Computer Science Engineering.

Notifications You must be signed in to change notification settings

Nicolas-Francisco/Sarcastic-messages-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sarcastic messages analysis

Context

Reddit is a social network in which users can participate by uploading all kinds of content, and interact through comments present on each publication. On this site there are different subreddits, which are forums dedicated to a specific topic, such as science, politics, music, and so on. In Reddit there is a convention that if a user writes a comment that should be interpreted sarcastically, the tag "/s" is written at the end of it to avoid any ambiguity, and determine that the comment should not be taken seriously.

Sarcastic messages analysis

Using the Sarcasm on Reddit and 1 million reddit comments datasets on this context, the main objectives this project are:

  • to discover if there are patterns that allow finding sarcastic comments based on its text, and being able to predict the nature of the comments (sarcastic or not).
  • to determine how context-dependent the sarcasm patterns are.
  • to address one of the main problems of natural language processing, which is the interpretation of sarcasm when there are no other paraverbal clues apart from its content.

Along with the objectives already mentioned, the purpose of this project is to answer the following questions:

  • Is it possible to predict if a message is sarcastic with its content? What is the best way to predict it?
  • Are the sarcasm patterns constant within different domains? Does the topic influence the classification process?
  • Do the patterns differ greatly between different domains?

The main file of the proyect Hito3.ipynb and its web page Hito3.html are contained in the Sarcastic-messages-analysis > Hito 3 folder on this repository.

About

This project was carried out through different Hitos. Each Hito accumulates certain progress on the development of the project, such as the study of the datasets used, or the use and improvement of data mining techniques (classification, clustering, etc.).

Hito 1

Contains the first approach on the study of sarcastic comments. An exploration of text data such as data cleansing, hot words and bag of words, most frequent subreddits, etc. The Data Exploration was made using RMarkdown and python's pandas library.

The main file Hito1.Rmd and its web page Hito1.html are contained in the Sarcastic-messages-analysis > Hito 1 folder.

Hito 2

On Hito 2 the firsts main questions were formulated, along with the first experiment using data mining methods using python's scikit-learn library

The main file Hito2.ipynb and its web page Hito2.html are contained in the Sarcastic-messages-analysis > Hito 2 folder.

Hito 3

This is the latest version of the proyect, which contains all of the experiments using sentence transformers library and python's scikit-learn library, along with the final conclusions.

The main file Hito3.ipynb and its web page Hito3.html are contained in the Sarcastic-messages-analysis > Hito 3 folder.

Authors

Credits

  • Professors: Felipe Bravo and Bárbara Poblete

Computer Sciences Department
Faculty of Physical and Mathematical Sciences
University of Chile

About

Study and characterization of sarcasm in Reddit Messages - Data Mining course Proyect, Computer Science Engineering.

Topics

Resources

Stars

Watchers

Forks