Word Embedding is just a vector representation of a word and words with similar meaning have nearly same representation.
Inorder to represent words, we generally use One-Hot encoding. In this technique, we will have a vector of dimension number
of words in the corpus or text which is filled with 0 except one location where we will have 1, which represents the word.
The disadvantage of this technique is we will have costly matrix multiplications which will result in mostly zero
valued hidden outputs. To solve this problem, we use embeddings.
Embeddings are just a fully connected layer that is learnt and it's weights are called embedding weights. We skip the
multiplication into the embedding layer by instead directly grabbing the hidden layer values from the weight matrix. Why??
We can do this because the multiplication of a one-hot encoded vector with a matrix returns the row of the matrix corresponding
the index of the input unit.
Instead of doing the matrix multiplication, we use the weight matrix as a lookup table. We encode the words as integers,
for example "heart" is encoded as 958, "mind" as 18094. Then to get hidden layer values for "heart", you just take the 958th
row of the embedding matrix. This process is called an embedding lookup and the number of hidden units is the
embedding dimension.
There is nothing magical going on here. The embedding lookup table is just a weight matrix. The embedding layer is just a
hidden layer. The lookup is just a shortcut for the matrix multiplication. The lookup table is trained just like any weight matrix.
Word2Vec is a statistical method for efficiently learning a word embedding from a text corpus. The Word2Vec algorithm finds much more
efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words.
Words that show up in similar contexts, such as "coffee", "tea", and "water" will have vectors near each other. Different words
will be further away from one another, and relationships can be represented by distance in vector space.
There are two architectures for implementing Word2Vec:
- CBOW (Continuous Bag-Of-Words)
- Skip-Gram
The embedding is learnt by predicting the current word based on its context.
Both CBOW and Skip-Gram has its own pros and cons, but Skip-Gram generally performs well. So, we will implement Skip-Gram. To see the implementation of the Skip-Gram, go to Skip-Gram.ipynb notebook or click here