Word Embedding in NLP

5 min readJul 5, 2021

What is Word Embedding ?

Word Embedding is term used in NLP ( Natural Language Processing) to represent the words into vectorized format. The words which are closer in vector space are expected to have similar meaning because it captures syntactic and semantic information as well.

Here we can observe that man-king , woman-queen are closely related with each other. But there is a distance between Paris and man, because these two words have different meaning i.e Paris is related to location or place and man-king or women-queen is related to Gender.

Why do we need Word Embedding?

Machine Learning algorithms cannot process words or text. It can process only numbers. So to feed data in algorithms it should be in numerical format. To achieve this we convert words or text into One-Hot format ( 0 or 1) or in vector format.

Types of Word Embedding

There are following two types of Word Embedding:

Frequency Based
Prediction Based

Frequency based Word Embedding

Count Vector : This vectorizer model learns from all document and then count the number of times each word appears in each document. For example, consider we have d documents and t is the number of different words in our vocabulary then the size of count vector matrix will be given by d*t. Let’s take the following two sentences:

Document 1: “I like eating Apple”

Document 2: “I work in Apple”

From these two documents, our vocabulary is as follows:

{ I, like, work, eating, in, Apple}

so D = 2, T = 6

Now, we count the number of times each word occurs in each document. In Document 1 and Document 2 , “I” and “Apple” appears twice, and “like”, “work”, “in”, and “eating” appear once

so the count vector matrix is :-

Here in count vector matrix we cannot conclude anything. It is just based on counts. We cannot find the context of Apple. Also, in such matrix there is problem of sparsity and model might loose out on some of the important frequent features. To solve this we have next vectorizer called “TF-IDF”.

TF-IDF (Term Frequency-Inverse Document Frequency) : It consists of 2 parts, TF (Term Frequency) multiplied with IDF (Inverse Document Frequency). The main intuition of TF-IDF is some words that appear frequently in 1 document and less frequently in other documents could be considered as providing extra insight for that 1 document and could help our model learn from this additional piece of information. In short, common words are penalized in TF-IDF. These are relative frequencies identified as floating point numbers.

In above matrix, we can see importance of the words which will help in making inference about the sentence.

Co-occurrence matrix with Context Window : When words occur together they might have some similar context. For example : Apple is a fruit. Mango is a fruit.

Apple and mango have similar context i.e. fruit.

Co-occurrence — For a given corpus, the co-occurrence of a pair of words say w1 and w2 is the number of times they have appeared together in a Context Window.
Context Window — Context window is specified by a number and the direction.

Lets take another example :

I am learning Machine Learning
I am learning Deep Learning
I like NLP

Let window size =1. This means that context words for each and every word are 1 word to the left and one to the right. Context words for:

I = am(2 times), like(1 time)
like = NLP ( 1 time)
am= learning (2 times)
learning = Machine(1 time), Deep(1 time)
Machine = learning(1 time)
Deep = learning(1 time)
NLP = like(1 time)

Prediction Based Word Embedding

CBOW and Skip-Gram : These are the techniques used in Word2vec. Word2vec is the most popular algorithms in the word embedding space. It was developed by Tomas Mikolov, a researcher at Google. Using above mentioned techniques in Frequency based model we didn’t get any semantic meaning from words of corpus. But for most of the applications of NLP tasks like sentiment classification, sarcasm detection etc require semantic meaning of a word and semantic relationships of a word with other words.

Word2vec is capable of drawing semantic information from words using CBOW and Skip-Gram.

CBOW (Continuous Bag Of Word) : CBow takes context words as input and predicting the center word within the window. Lets take an example :

Here sentence is “Pineapples are spikey and yellow” and we have to predict the middle word Pineapples are ____ and yellow

Input : Pineapples, are, and, yellow

Output : spikey

CBOW model takes the input, sends them to a hidden layer (embedding layer) and from there it predicts the target word.

Skip-Gram : Skip-Gram takes the center word from the window size words as an input and context words (neighbour words) as outputs.

Here sentence is “Pineapples are spikey and yellow” and we have to predict the surrounding words “___ __ spikey __ __”

Input : spikey

Output : Pineapples, are, and, yellow

Skip-Gram model takes the input as each word in the corpus, sends them to a hidden layer (embedding layer) and from there it predicts the context words.

Checkout my other blogs on :

Kick start career in Data Science

Exploratory Data Analysis

Word Embedding in NLP

Written by Shivangijain