Introduction

This project is based off the first part of the Fake News Challenge(FNC-1) where the goal is "to explore how AI...might be able to combat the fake news problem." Identifying fake news is a complex task that can be broken down into a few steps, with a potential first being the comparison of topics or, more precisely, stance detection, across myriad news organizations. The goal of this project is to classify the relationship between a body of text with a headline as agree, disagree, discuss or unrelated.

As we are working with string data, there is a lot of pre-processing and feature engineering to do.

Data Preprocessing

removing punctuation
lower casing all text
removing stop words
tokenizing
stemming
creating n-grams

Feature Engineering

basic n_gram count ratios
TF-IDF vectorization
SVD
word embeddings with Word2Vec using the Google News Corpus pre-trained weights
sentiment features to assign polarity, using nltk Sentiment Analyzer with VaderSentiment

Modeling

Completed models include:

Adaboost
Gradient Boosing
LSTM Deep Learning

Dependancies

Scipy (pandas, numpy, matplotlib)
Scikit-Learn
NLTK
Gensim
Keras

I completed all work on a Google Compute Engine (GCE) VM. You can get up and running quickly with a jupyter connected VM on GCE following my post on Medium
Executing the preprocessing and feature engineering in one session will need ~800GB though, all data will be pickled for later use. I would however advise using numpy's .npz instead as it has quicker read/write times.

Future Work

Improve model performance
- Different preprocessing
- Model parameters
- Explore Feature importance - which help most and why?
Model visualizations
Build Convolutional Neural Network
Begin real world application
- Scrape news sites, archive data
- Build a working record of sources and reliability

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
.gitignore		.gitignore
Adaboost.ipynb		Adaboost.ipynb
Fake News Challenge.pdf		Fake News Challenge.pdf
Feature_Engineering.ipynb		Feature_Engineering.ipynb
Gradient_Boosting.ipynb		Gradient_Boosting.ipynb
LSTM_deep_learning.ipynb		LSTM_deep_learning.ipynb
ReadMe.md		ReadMe.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Data Preprocessing

Feature Engineering

Modeling

Dependancies

Future Work

About

Releases

Packages

Languages

mgavish/Fake-News-Challenge-NLP

Folders and files

Latest commit

History

Repository files navigation

Introduction

Data Preprocessing

Feature Engineering

Modeling

Dependancies

Future Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages