Word2Vec

About Project:

Word Embeddings

Word Embedding is just a vector representation of a word and words with similar meaning have nearly same representation. Inorder to represent words, we generally use One-Hot encoding. In this technique, we will have a vector of dimension number of words in the corpus or text which is filled with 0 except one location where we will have 1, which represents the word. The disadvantage of this technique is we will have costly matrix multiplications which will result in mostly zero valued hidden outputs. To solve this problem, we use embeddings.

Embeddings are just a fully connected layer that is learnt and it's weights are called embedding weights. We skip the multiplication into the embedding layer by instead directly grabbing the hidden layer values from the weight matrix. Why?? We can do this because the multiplication of a one-hot encoded vector with a matrix returns the row of the matrix corresponding the index of the input unit.

Instead of doing the matrix multiplication, we use the weight matrix as a lookup table. We encode the words as integers, for example "heart" is encoded as 958, "mind" as 18094. Then to get hidden layer values for "heart", you just take the 958th row of the embedding matrix. This process is called an embedding lookup and the number of hidden units is the embedding dimension.

There is nothing magical going on here. The embedding lookup table is just a weight matrix. The embedding layer is just a hidden layer. The lookup is just a shortcut for the matrix multiplication. The lookup table is trained just like any weight matrix.

Word2Vec

Word2Vec is a statistical method for efficiently learning a word embedding from a text corpus. The Word2Vec algorithm finds much more efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words.

Words that show up in similar contexts, such as "coffee", "tea", and "water" will have vectors near each other. Different words will be further away from one another, and relationships can be represented by distance in vector space.

There are two architectures for implementing Word2Vec:

CBOW (Continuous Bag-Of-Words)
Skip-Gram

1. CBOW (Continuous Bag-Of-Words):

The embedding is learnt by predicting the current word based on its context.

2. Skip-Gram

Skip Gram model learns by predicting the surrounding words given a current word

Both CBOW and Skip-Gram has its own pros and cons, but Skip-Gram generally performs well. So, we will implement Skip-Gram. To see the implementation of the Skip-Gram, go to Skip-Gram.ipynb notebook or click here

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.ipynb_checkpoints		.ipynb_checkpoints
assets		assets
README.md		README.md
Skip Gram.ipynb		Skip Gram.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word2Vec

About Project:

Word Embeddings

Word2Vec

1. CBOW (Continuous Bag-Of-Words):

2. Skip-Gram

About

Releases

Packages

Languages

Surya-Prakash-Reddy/Word2Vec

Folders and files

Latest commit

History

Repository files navigation

Word2Vec

About Project:

Word Embeddings

Word2Vec

1. CBOW (Continuous Bag-Of-Words):

2. Skip-Gram

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages