nlp-duplicate-questions-stackoverflow

Problem

Using word embedding to find duplicate questions from StackOverflow.
Week 3 course of Natural Language Processing course from Coursera.

Solution

To solve the problem, you will use two different models of embeddings:

Pre-trained word vectors from Google which were trained on a part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. GoogleNews-vectors-negative300.bin.gz will be downloaded in download_week3_resources().
Representations using StarSpace on StackOverflow data sample. You will need to train them from scratch.

Libraries

In this task you will you will need the following libraries:

StarSpace — a general-purpose model for efficient learning of entity embeddings from Facebook Gensim — a tool for solving various NLP-related tasks (topic modeling, text representation, ...) Numpy — a package for scientific computing. scikit-learn — a tool for data mining and data analysis. Nltk — a platform to work with human language data.

Data

Questions from StackOverflow provided by Coursera
Pre-trained word vectors from Google which were trained on a part of Google News dataset (about 100 billion words) [GoogleNews-vectors-negative300.bin.gz]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
week3_Embeddings.ipynb		week3_Embeddings.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nlp-duplicate-questions-stackoverflow

Problem

Solution

Libraries

Data

About

Releases

Packages

Languages

vgp314/nlp-duplicate-questions-stackoverflow

Folders and files

Latest commit

History

Repository files navigation

nlp-duplicate-questions-stackoverflow

Problem

Solution

Libraries

Data

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages