Data Representation in NLP

Shivangi Singhal
13 min readJun 8, 2019


What is Vectorization ?

We all know that computer understand binary language in the form of 0’s and 1’s. It is impossible to make them understand words naturally. But encoding such words into numeric form can solve our problem.

The process of converting textual information into numbers is called Vectorization. It is also termed as feature extraction.

Different ways to convert text into numbers are: Sparse Vector Representations and Dense Vector Representations

Note: The GitHub codes of this blog are available. To know how to process data before making its representation, go to this blog.

Sparse Vector Representations

(1) Bag of Words (BoW)

Suppose I have a text document. Cut this document into words (i.e. perform word tokenization ) and remove any kind of punctuations. Now, imagine that you are putting every new word (that has not been seen before) in a bag. After completing the process with every word, you will have a bag full of unique ( non-repeatable) tokens. This is what we call as Bag of Words (BoW) for a document. i.e.

BoW Explaination

Bag of Words (BOW) is an algorithm that counts how many times a word appears in a document. Those word counts allow us to compare documents and gauge their similarities for applications like search, document classification, and topic modeling.

It is named so because it is only concerned with the occurrence of the word and not where it is placed (i.e. order) in bag. Moreover, the intuition behind such approach is that similar documents contain similar words.

Tf-Idf numerical example

NOTE: In the above approach, each word or token is called a “gram”.

A better approach will be to create a vocabulary of grouped words. This will change the scope of the vocabulary and will allow the bag-of-words to capture a little bit more meaning from the document.

Creating a vocabulary of two-word pairs is called a bigram model.

A vocabulary of triplets of words is called a trigram model.

For example, the bigrams for “It was the best of times” are as follows:

  • “it was”
  • “was the”
  • “the best”
  • “best of”
  • “of times”

For example, the trigrams for “It was the best of times” are as follows:

  • “it was the”
  • “was the best”
  • “the best of ”
  • “best of times”

The general approach is called the n-gram model, where n refers to the number of grouped words.


simple to understand and implement


  • Vocabulary consists of all unique words present in the given data. If data is too large and contains many unique words then it can effect the sparsity in representations.

Sparsity increases both space and time complexity. It adds challenge for the models to extract little information from large representational space.

(2) Lack of ordering of words: Bag of words model is only concerned with the occurrence of the word and not where it is placed (i.e. order) in bag. This leads to loss of contextual information and in turn to meaning of words in the document (semantics).

For example, model lacks the understanding of the (a) same words that are just differently arranged (“this is interesting” vs “is this interesting”), (b) synonyms (“old bike” vs “used bike”), and much more.

The task is performed using CountVectorizer()

(2) Tf-idf Vectorization

Tf-Idf is shorthand for term frequency-inverse document frequency. So, two things: term frequency and inverse document frequency.
Term frequency (TF) is basically the output of the BoW model. For a specific document, it determines how important a word is by looking at how frequently it appears in the document. Term frequency measures the importance of the word. If a word appears a lot of times, then the word must be important.

IDF (Inverse Document Frequency) used to calculate the weight of rare words across all documents. The words that occur rarely in the corpus have a high IDF score. However, it is known that certain terms, such as “I”, “a” may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones

(3) Hashing Vectorization


  • it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory
  • it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
  • it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.


  • there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.
  • there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).
  • no IDF weighting as this would render the transformer stateful.

Dense Vector Representations

Word Embeddings

Suppose you open Google and search for a news article on the Tokyo 2020 Olympics and get hundreds of search results in return about it. We the humans can deal with text format quite intuitively but we have millions of documents being generated every single day and for that we cannot have humans performing any desired tasks. It is neither scalable nor effective.

The desired tasks are: Clustering, Classification, Sentiment Analysis and so on.

So, we have to make computers perform such tasks on text data but since we know that they are generally inefficient at handling and processing strings or texts to give efficient outputs, This calls for a Problem!!!

We know computer can match two strings and tell you whether they are same or not. But how do we make computers tell us about football or Ronaldo when you search for Messi? or How do you make a computer understand that “Apple” in “Apple is a tasty fruit” is a fruit that can be eaten and not a company?

The answer to the above questions lies in creating a representation for words that capture their meanings, semantic relationships and the different types of contexts they are used in.

And all of these are implemented by using Word Embeddings.

Define: Word Embeddings are numerical representations of texts so that computers may handle them.

Where is Word Embedding used?

Word embedding helps in feature generation, document clustering, text classification, and natural language processing tasks. Let us list them and have some discussion on each of these applications.

  • Compute similar words: Word embedding is used to suggest similar words to the word being subjected to the prediction model. Along with that it also suggests dissimilar words, as well as most common words.
  • Create a group of related words: It is used for semantic grouping which will group things of similar characteristic together and dissimilar far away.
  • Feature for text classification: Text is mapped into arrays of vectors which is fed to the model for training as well as prediction. Text-based classifier models cannot be trained on the string, so this will convert the text into machine trainable form. Further its features of building semantic help in text-based classification.
  • Document clustering is another application where word embedding is widely used
  • Natural language processing: There are many applications where word embedding is useful and wins over feature extraction phases such as parts of speech tagging, sentimental analysis, and syntactic analysis.

Why use this approach:

In Sparse Vector representations (Latent Semantic Analysis) concept of Bag of words is used where words are represented in the form of encoded vectors. It is a sparse vector representation where the dimension is equal to the size of vocabulary. If the word occurs in the dictionary, it is counted, else not. But Disadvantages of Bag of Words method is

  • It ignores the order of the word, for example, ‘this is bad ‘= ‘bad is this’.
  • It ignores the context of words. Suppose If I write the sentence “He loved books. Education is best found in books”. It would create two vectors one for “He loved books” and other for “Education is best found in books.” It would treat both of them orthogonal which makes them independent, but in reality, they are related to each other

To overcome these limitations, Word Embeddings is developed.

Word2Vec is an approach to implement such.

Word2Vec Representations

Word2Vec basically place the word in the feature space is such a way that their location is determined by their meaning i.e. words having similar meaning are clustered together and the distance between two words also have same meaning.

What is word2vec?

Word2vec is the technique/model to produce word embedding for better word representation. It captures a large number of precise syntactic and semantic word relationship. It is a shallow two-layered neural network.

The shallow neural network consists of the only a hidden layer between input and output (whereas deep neural network contains multiple hidden layers between input and output).

Shallow Neural Network

What word2vec does?

Word2vec represents words in vector space representation. Words are represented as vectors and they are placed in such a way that words having similar meaning appear together and dissimilar words are situated far away from each other.

This process of keeping similar words together and keeping dissimilar words apart from each other is termed as a semantic relationship.

Neural networks do not understand text instead they understand only numbers. Word Embedding provides a way to convert text to a numeric vector.

Word2vec reconstructs the linguistic context of words. But what is linguistic context? Generally when we communicate with anyone , the other people try to figure out what is objective of the sentence.

For example, “What is the temperature of India”, here the user wants to know “temperature of India” is the context. In short, the main objective of a sentence is context.

Word or sentence surrounding spoken or written language (disclosure) helps in determining the meaning of context.

Word2Vec learns vector representation of words through the contexts.

Advantages of Word2Vec

high quality embeddings can be learned pretty efficiently, especially when comparing against neural probabilistic models. That means low space and low time complexity to generate a rich representation.

How is word2vec different from Sparse Vector Representations (SVR) or Why do we need Word2Vec, GloVe, FastText when we have BoW and Tf-Idf….

The main difference lies in the fact that tf-idf focuses on getting the term vectors, whereas Word2Vec focuses on idea of a word/term being represented by a vector.


How Tf-Idf works?

The similarity here means how commonly they are occurring in the corpus, but does not mean similar in the meaning.

In Word2Vec,

We assume that If two words occur at a same “position” in two sentences, then they are very much related either in semantics or syntactics. They represent more advanced vector representations of term, and can determine the meaning of a word by looking at its company (its context).

For example, in a big text corpus, there are two sentences:

Sentence 1:“BMW is a German car manufacturer”

Sentence 2: “BMW is a German automobile manufacturer”.

Word2vec Architecture

There are two architectures used by word2vec

  1. Continuous Bag of words (CBOW)
  2. skip gram

Why these architectures or models are important from word representation point of view?

Learning word representation is essentially unsupervised, but targets (labels) are needed to train the model. Both Skip-gram and CBOW convert unsupervised representation to supervised form for model training.

In both models, a window of predefined length is moved along the corpus, and in each step the network is trained with the words inside the window.
Whereas the CBOW model is trained to predict the word in the center of the window based on the surrounding words, the Skip-gram model is trained to predict the contexts based on the central word. Once the neural network has been trained, the learned linear transformation in the hidden layer is taken as the word representation.

Word2vec provides an option to choose between CBOW (continuous Bag of words) and skim-gram. Such parameters are provided during training of the model. One can have the option of using negative sampling or hierarchical softmax layer (advanced level concept, if curious Google it.

Continuous Bag of Words Model

The last output vector of Neural Network has a size equal to length of vocabulary. It gives a probability value of how close the context words and target word are related to each other. Eg: In above example,

Advantages of CBOW:

  1. Being probabilistic is nature, it is supposed to perform superior to deterministic methods(generally).
  2. It is low on memory. It does not need to have huge RAM requirements like that of co-occurrence matrix where it needs to store three huge matrices.

Disadvantages of CBOW:

  1. CBOW takes the average of the context of a word (as seen above in calculation of hidden activation). For example, Apple can be both a fruit and a company but CBOW takes an average of both the contexts and places it in between a cluster for fruits and companies.
  2. Training a CBOW from scratch can take forever if not properly optimized.

Skip Gram Model:

Which model to choose?

CBOW is several times faster than skip gram and provides a better frequency for frequent words whereas skip gram needs a small amount of training data and represents even rare words or phrases.

Implementation via Gensim

Problems encountered:

Till now, we took a corpus, generated word embeddings for those words and using word2vec, find similar words, dissimilar words, dimensional reduction, and many others. But a major drawback for such learning is that, words not belonging to the corpus defined in the model will throw error.

In other words, if my corpora is ‘abc’ then words belonging to this corpora can be used for analysis but not other words that are outside that corpora.

This calls for the following solutions:

(a) use pre-trained embeddings

(b) use customised embeddings

Pre-trained Word Embeddings

Pre-trained models are the simplest way to start working with word embeddings. A pre-trained model is a set of word embeddings that have been created elsewhere that you simply load onto your computer and into memory.

The advantage of these models is that they can leverage massive datasets that you may not have access to, built using billions of different words, with a vast corpus of language that captures word meanings in a statistically robust manner.

Example training data sets include the entire corpus of wikipedia text, the common crawl dataset, or the Google News Dataset.

Using a pre-trained model removes the need for you to spend time obtaining, cleaning, and processing (intensively) such large datasets.

Pre-trained models are also available in languages other than English, opening up multi-lingual opportunities for your applications.


The disadvantage of pre-trained word embeddings is that the words contained within may not capture the minute details of language in your specific application domain. For example, Wikipedia may not have great word exposure to particular aspects of legal doctrine or religious text, so if your application is specific to a domain like this, your results may not be optimal due to the generality of the downloaded model’s word embeddings.

Pre-trained models in Gensim

Gensim doesn’t come with the built in models, so to load a pre-trained model into Gensim, you first need to find and download one.

A list of pertained models that can be downloaded and used are: Google’s Word2Vec, FastText, Godin, GloVe. See [7] for more details.

Note: A popular pre-trained option is the Google News dataset model, containing 300-dimensional embeddings for 3 millions words and phrases.

Google Word2Vec

It is deep learning technique with two-layer neural network.Google Word2vec take input from large data (in this scenario we are using google data) and convert into vector space. Google word2vec is basically pretrained on google dataset.

Link to pre-trained Google Word2Vec model :

Custom Word Embeddings

Training your own word embeddings for specific problem domains (eg: fake news, Healthcare etc) , will lead to enhanced performance over pre-trained models.

The Gensim library provides a simple API to the Google word2vec algorithm which is a go-to algorithm for beginners.

To train your own model, the main challenge is getting access to a training data set. Computation is not a big problem as you will be able manage to process a large model on a powerful laptop in hours rather than days.


(1) define Sparse Vector Representations

(2) Word2Vec vs Sparse Vector Representations

(3) Continuous Bag of Words Model

(4) Skip Gram Model

(5) NLTK Corpora

(6) Reference for word embeddings

(7) List of Pre-trained Word Embeddings

[8] Different application of Gensim Word2Vec implementation like: finding most similar words, dissimilar words and so on

[9] Overview of GloVe and FastText