Fraudulent-News-Detection

The aim of this project is to provide aid in detecting fraudulent news that has been on the rise on social media and news platforms online.

Specifically, the project consists of the following parts:

Preliminaries:
- Exploratory analysis: examination of data and exclusion of non-useful records
- Pre-processing: removal of stop words (e.g., "the", "a", "an", and "in")
- Text visualization: bar plots/ word clouds of frequent words for each feature column
- Sentiment analysis: creation of sentiment features (i.e., positive, negative, neutral) using polarity scores
Modeling:
We first built the Logistic Regressions using different sets of features:
- Title Sentiments
- Text Sentiments
- Title Sentiments & Text Sentiments
- Title (transformed into embedding vectors using Word2Vec embedding model with skip-gram method)
- Text (transformed into embedding vectors using Word2Vec embedding model with skip-gram method)
- Title & Text (concatenated Title and Text vector representations)
  
  Since the combined feature of Title and Text demonstrated competency, we used this feature for training non-linear models. In particular, Decision Tree and Neural Network (MLP).
  
  Lastly, we tried something different; while the models mentioned so far based on the skip-gram word embedding method, now we wanted to observe if the performance differs by using another word embedding technique: Continuous Bag of Words (CBOW).
Model Comparison: As the dataset was fairly balanced (i.e., 22,850 fake records and 21,416 real records), we defined the measurement of goodness to be accuracy. Among the models trained with skip-gram and the model trained with CBOW, the highest performing models on the validation dataset were the logistic regression and MLP with the combined feature of Title & Text constructed using skip-gram. Further, we introduced a new dataset from Kaggle containing fake & real news records to test the models' performance, where the logistic classifier led to the highest accuracy.

Below are the folders and files created for this project:

Data: This folder contains data we used in the project
- Fake.csv: ISOT data containing records of fraudulent news
- True.csv: ISOT data containing records of real news
- kaggle_dataset.csv: data containing both fraudulent and real news records, used for testing the model performance
Fake_News_code.ipynb: This is the script that performs everything from the preliminaries to model comparison
Fake_News_code.html: html file of the code for a quick view

For the complete methodologies, results, and discussion, please see Model Analysis to Provide Aid in Detecting Fake News.pdf.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fraudulent-News-Detection

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Data		Data
Fake_News_code.html		Fake_News_code.html
Fake_News_code.ipynb		Fake_News_code.ipynb
Model Analysis to Provide Aid in Detecting Fake News.pdf		Model Analysis to Provide Aid in Detecting Fake News.pdf
README.md		README.md

Sayaka-K/Fraudulent-News-Detection

Folders and files

Latest commit

History

Repository files navigation

Fraudulent-News-Detection

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages