Skip to content

Using word embedding to find duplicate questions from StackOverflow. Week 3 course of Natural Language Processing course from Coursera.

Notifications You must be signed in to change notification settings

vgp314/nlp-duplicate-questions-stackoverflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

nlp-duplicate-questions-stackoverflow

Problem

  • Using word embedding to find duplicate questions from StackOverflow.
  • Week 3 course of Natural Language Processing course from Coursera.

Solution

To solve the problem, you will use two different models of embeddings:

  • Pre-trained word vectors from Google which were trained on a part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. GoogleNews-vectors-negative300.bin.gz will be downloaded in download_week3_resources().
  • Representations using StarSpace on StackOverflow data sample. You will need to train them from scratch.

Libraries

In this task you will you will need the following libraries:

StarSpace — a general-purpose model for efficient learning of entity embeddings from Facebook Gensim — a tool for solving various NLP-related tasks (topic modeling, text representation, ...) Numpy — a package for scientific computing. scikit-learn — a tool for data mining and data analysis. Nltk — a platform to work with human language data.

Data

  • Questions from StackOverflow provided by Coursera
  • Pre-trained word vectors from Google which were trained on a part of Google News dataset (about 100 billion words) [GoogleNews-vectors-negative300.bin.gz]

About

Using word embedding to find duplicate questions from StackOverflow. Week 3 course of Natural Language Processing course from Coursera.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published