Processing Text data in Natural Language Processing

Shivangi Singhal
13 min readJun 7, 2019


Hello Everyone!

This blog is exclusively for beginners who are willing to start their journey in NLP. I have tried to explain things in an extremely easy manner to give you a smooth start.

To start with, the very first and important step to deal with the text is pre-pocessing. It contains a series of steps that make data ready for the computers to read and analyze. After successfully completing the first step, the next step involves extracting features from the text for further analysis. All this will be explained here, so let’s get started!!!

This blog introduces you to theoretical concepts, for practical version of it, please refer to python codes on GitHub.

What is NLP?

A field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages.

Data Pre-Processing

It is also termed as data cleaning or data wrangling. It is a process of converting data from the initial raw form into another format, so that the prepared can be used for further analysis.

Why do we need data pre-processing?

  • Data Cleaning: this helps in removing noise and resolve inconsistencies in the data.
  • Data Integration: means merging of data from multiple sources .This process must be carefully implemented in order to avoid redundancies and inconsistencies in the resulting data set . Conflicts within the data are resolved.
  • Data Transformation: Data is normalized, aggregated and generalized

Different steps to perform processing includes removal of:

(1) Lower case


(3) Punctuation

(4) Contractions

(5) Stop words

(6) Alpha numeric characters

(7) Stemming and Lemmatization

Removing lowercase

It is the first step that must be done before starting with any pre-processing tasks. It helps in maintaining consistency of expected output.


(a)solves sparsity issue- There can be situations where instances in the text represents same word but are written using different cases (i.e. using only capital letters like MANGO, using only lowercase like mango, a combination of lower and upper case like ManGo, manGo, MAnGo), such words are mapped to same lowercase form of the given word.

For example: MANGO, ManGo, mangO, MAnGo are all mapped to ‘mango

(b) speed up search process- suppose you are looking for a word ‘uk’ in a document but could not fetch any results. This may be because the word was represented as ‘UK’ in the document. To avoid such kind of issues and speed up the research process, lowercasing is useful.

Disclaimer: Though lowercasing is a useful step to perform but there are situations where we want to preserve capitalization, it should be avoided then.


A process by which big quantity of text is divided into smaller parts is called tokenization. We know that a series of characters arranged in a semantic form leads to formation of a word (eg: ‘o’, ‘r’, ‘a’, ’n’, ‘g’, ‘e’, ‘s’ form oranges). Similarly, a series of words form sentences (eg: ‘I’, ‘love’, ‘oranges’ makes a sentence: I love oranges).

All these small units of a sentence when joined together to depict semantic information is defined as tokens.

Note: Larger chunks of data can be tokenized into sentences, sentences can be tokenized into words, etc.

Two types are :
* Sentence Tokenization
* Word Tokenization

Word Tokenization: A process of splitting a sentence into words via unique space character. The only drawback is that it may also tokenize multi-word expressions like New York to ‘New’ and ‘York’.

Sentence Tokenization: A process of splitting a document into sentences via punctuation marks (like fullstop(.), question mark(?), exclaimation mark(!) and so on) as delimiter. These delimiter is helpful in telling when the sentence ends.


Contractions (Appos)

The words that are written with an apostrophe are termed as contractions. For example: don’t, I’ll, can’t etc. Since we aim to to standardize our text, it makes sense to expand these contractions i.e. don’t- ‘do not’, can’t -cannot and I’ll-I will.

Stop Words

In NLP, useless words are termed as stop words. These are the words that occur most commonly in a document (eg: a, an, the, in etc.). Such words are filtered out before further processing of text, since the words contribute little to overall meaning.

Removing stop words help in reducing the: (a) storage space in the database, (b) processing time

NLTK library has a list of stopwords stored in 16 different languages. You can find them in the nltk_data directory [2].

Alpha-numeric Characters

alphanumeric characters are those that are comprised of a combined set of the 26 alphabetic characters, A to Z, and the 10 Arabic numerals, 0 to 9.

Non alpha-numeric characters: ! ++ << [ % , <=<= ] & — <><> | ‘ . = ~ ( / == ~= ) /! >> (space)

// >>= (return) ! { ? ` * } @ : ; ^ |= &= += -= = /= *=

Morphological Normalization

This type of normalization is needed when there are multiple representation of a single word. For example: play, player, playing, played are all mapped to ‘play’. Though such words mean different but contextually they are all similar.

The step converts all the disparities of a word into their normalized form (also known as stem/lemma).

Normalization is an important step for feature engineering with text as it converts the high dimensional features (n different features) to the low dimensional space (1 feature), which is considered as an ideal task for any ML model.

Inflection: In grammar, inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood. An inflection expresses one or more grammatical categories with a prefix, suffix or infix, or another internal modification such as a vowel change. For example: playing, plays, played- play

Sentence example: the boy’s car has different colors: the boy car has differ color

Applications: Stemming and Lemmatization are widely used in tagging systems, indexing, SEOs, Web search results, and information retrieval. For example, searching for fish on Google will also result in fishes, fishing as fish is the stem of both words.


It is a process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.

Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis).

Disadvantage: stemming a word or sentence may result in words that are not actual words. Eg: daily converted to dai when performed stemming which makes no-sense.

Stems are created by removing the suffixes or prefixes used with a word. Note: Removing suffix from a word is termed as suffix stripping.

Types of Stemming: Stemming is usually based on heuristics, it is far from perfect. In fact, it commonly suffers from two issues in particular: overstemming and understemming.

(1) Overstemming comes from when too much of a word is cut off. This can result in nonsensical stems, where all the meaning of the word is lost or muddled. Or it can result in words being resolved to the same stems, even though they probably should not be.

Take the four words university, universal, universities, and universe. A stemming algorithm that resolves these four words to the stem “univers” has overstemmed. While it might be nice to have universal and universe stemmed together and university and universities stemmed together, all four do not fit. A better resolution might have the first two resolve to “univers” and the latter two resolve to “universi.” But enforcing rules that make that so might result in more issues arising.

(2) Understemming is the opposite issue. It comes from when we have several words that actually are forms of one another. It would be nice for them to all resolve to the same stem, but unfortunately, they do not.

This can be seen if we have a stemming algorithm that stems the words data and datum to “dat” and “datu.” And you might be thinking, well, just resolve these both to “dat.”

Computer program that stems a word is called as stemmer. NLTK has stemmer for both English and Non-English.

How stemming works: Stemming algorithms are typically rule-based. You can view them as heuristic process that sort-of lops off the ends of words. A word is looked at and run through a series of conditionals that determine how to cut it down.

Different stemming algorithms: For English language, we have PorterStammer, LancasterStammer and SnowballStemmer.

For Non-English language, we have SnowballStemmers ( for Danish, Dutch, English, French, German,Hungarian, Italian, Norwegian, Porter, Portuguese, Romanian, Russian, Spanish, Swedish languages), ISRIStemmer (for Arabic language) , RSLPSStemmer (for Portuguese language).


The word ‘lemma’ means the canonical form, dictionary form, or citation form of a set of words. In Lemmatization, root word is called Lemma.

For lemmatization to resolve a word to its lemma, it needs to know its part of speech. That requires extra computational linguistics power such as a part of speech tagger. This allows it to do better resolutions (like resolving is and are to “be”).

Difference between Stemming and Lemmatization is as follows:

Since lemma is the base form of all its inflectional forms, whereas a stem isn’t, this causes few issues:

(a) the stem can be the same for the inflectional forms of different lemmas. This translates into noise in our search results. In fact, it is very common to find entire forms as instances of several lemmas

(b) the same lemma can correspond to forms with different stems, and we need to treat them as the same word. For example, in Greek, a typical verb has different stems for perfective forms and for imperfective ones. If we were using stemming algorithms we won’t be able to relate them with the same verb, but using lemmatization it is possible to do so.

Object Standardization

If a document contains words or phrases that are not in standard lexical dictionary form then such words are not recognized by search engines and models and must be removed. The process of removing such words is called as object standardization. Eg: acronyms (rt-retweet, dm- direct message), hashtags with attached words, and colloquial slangs.

How to do this task?

To perform object standardization, task in hand plays an important role. With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed.


There are some combination of words (phrases) in English that makes more sense when co-occur together than they occur individually for a given text, such phrases are termed as collocation. For example, in hospital: CT SCAN makes more sense than ‘CT’ and ‘SCAN’.

The two most common types of collocation are:

(a) bigrams: having two adjacent words together, eg: ‘CT scan’, ‘machine learning’, ‘social media’

(b) Trigrams: having three adjacent words together, eg: ‘out of business’, ‘games of thrones’.

Why collocations are important:

a) Keyword extraction: identifying the most relevant keywords in documents to assess what aspects are most talked about
b) Bigrams/Trigrams can be concatenated (e.g. social media -> social_media) and counted as one word to improve insights analysis, topic modeling, and create more meaningful features for predictive models in NLP problems

How to find collocations in a document:

For a given sentence, there can be many combination of bi-grams and tri-grams that can be created. But every bi-gram is not useful. We have to create a filter method that can pick only relevant bi-grams and tri-grams.

There are different ways to filter out useful and relevant collocations such as: frequency counting, Pointwise Mutual Information (PMI), and hypothesis testing (t-test and chi-square).

Text to features: Feature Engineering on Text Data

The above documentation gives an overview of pre-processing raw textual data. Now we will move ahead and learn how to extract features from such processed data for further analysis.

Various method to construct textual features are as follows: Syntactical Parsing, Entities extraction, Statistical features, and Word Embeddings.

(1) Syntactical Parsing (Dependency Parsing)

It is a task of extracting a dependency parse of a sentence that represents its grammatical structure and defines the relationships between “head” words and words, which modify those heads. Moreover, Dependency grammar describe the structure of sentences as a graph (tree) and Nodes (v) represent words and Edges (e) represent dependencies.

Example showcasing dependency parsing and POS Tagging

(2) Part-of-speech Tagging

The part of speech explains how a word is used in a sentence. There are eight main parts of speech — nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions and interjections.

  • Noun (N)- Daniel, London, table, dog, teacher, pen, city, happiness, hope
  • Verb (V)- go, speak, run, eat, play, live, walk, have, like, are, is
  • Adjective(ADJ)- big, happy, green, young, fun, crazy, three
  • Adverb(ADV)- slowly, quietly, very, always, never, too, well, tomorrow
  • Preposition (P)- at, on, in, from, with, near, between, about, under
  • Conjunction (CON)- and, or, but, because, so, yet, unless, since, if
  • Pronoun(PRO)- I, you, we, they, he, she, it, me, us, them, him, her, this
  • Interjection (INT)- Ouch! Wow! Great! Help! Oh! Hey! Hi!

Most POS are divided into sub-classes. POS Tagging simply means labeling words with their appropriate Part-Of-Speech.

POS tagging is a supervised learning solution. It uses features like the previous word, next word, is first letter capitalized etc.

NLTK has a function to get pos tags and it works after tokenization process.

The most popular tag set is Penn Treebank tagset. Most of the already trained taggers for English are trained on this tag set. Complete list is available @[8].

POS tagging is used for many important purposes in NLP:

(1) Word sense disambiguation: Some language words have multiple meanings according to their usage. For example, in the two sentences below:

I. “Please book my flight for Delhi”

II. “I am going to read this book in the flight”

“Book” is used with different context, however the part of speech tag for both of the cases are different. In sentence I, the word “book” is used as verb, while in II it is used as noun. (Lesk Algorithm is also used for similar purposes)

(2) Improving word-based features: A learning model could learn different contexts of a word only word are used as features. For example:

Sentence -“book my flight, I will read this book”

Tokens — (“book”, 2), (“my”, 1), (“flight”, 1), (“I”, 1), (“will”, 1), (“read”, 1), (“this”, 1)

But if, the part of speech tag is linked with them, the context is preserved, thus making strong features. For example:

Tokens with POS — (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1), (“will_MD”, 1), (“read_VB”, 1), (“this_DT”, 1), (“book_NN”, 1)

(3) Normalization and Lemmatization: POS tags are the basis of lemmatization process for converting a word to its base form (lemma).

(4) Efficient stopword removal : P OS tags are also useful in efficient removal of stopwords.

For example, there are some tags which always define the low frequency / less important words of a language. For example: (IN — “within”, “upon”, “except”), (CD — “one”,”two”, “hundred”), (MD — “may”, “mu st” etc)

Entity Extraction

Topic Modelling & Named Entity Recognition are the two key entity detection methods in NLP.

(a) Topic Modelling

It is a process to automatically identify topics present in a text object and to derive hidden patterns present in the text corpus. This helps in better decision making.

Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model should suggest words like “health”, “doctor”, “patient”, “hospital” for a topic — Healthcare, and “farm”, “crops”, “wheat” for a topic — “Farming”.

Note: LDA model is used to perform Topic Modelling.

(b) Named Entity Extraction

Named entity recognition (NER) is the task of tagging entities in text with their corresponding type.

Named Entity Recognition, also known as entity extraction classifies named entities that are present in a text into pre-defined categories like “individuals”, “companies”, “places”, “organization”, “cities”, “dates”, “product terminologies” etc. It adds a wealth of semantic knowledge to your content and helps you to promptly understand the subject of any given text.

Text Matching

It is task of finding out how similar two documents are. There are generally two ways to perform this task:

(a) Edit distance: also called as Levenstein Distance. It compute edit distance between two words/strings. The algorithm is based on dynamic programming.

Edit Distance formula

If both the characters matched, simply take diagonal element of the matrix and place that in the current cell.

If characters do not, then find minimum from left, top and right cell and add 1 to it, place final answer in the current cell.

what do we mean by (i,j),(i-1,j-i),(i-i,j)
Edit distance for strings: ‘strength’ and ‘trend’. Last cell in the matrix depicts edit distance i.e. number of common characters in strings

(b) Cosine Similarity: Cosine similarity calculates similarity by measuring the cosine of angle between two vectors.

The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together.

The smaller the angle, higher the cosine similarity.

Cosine Similarity Formula


(1)Tutorial on Regular Expression:

(2) Stop words in 16 languages:

(3) Stemming and Lemmatization:

(4) Stemming and Lemmatization:


(6) Collocations:

(7) POS Tagging:

(8) Alphabetical list of part-of-speech tags used in the Penn Treebank Project

(9) Dependency parsing: