Processing Text data in Natural Language Processing

Shivangi Singhal
13 min readJun 7, 2019

Hello Everyone!

This blog is exclusively for beginners who are willing to start their journey in NLP. I have tried to explain things in an extremely easy manner to give you a smooth start.

To start with, the very first and important step to deal with the text is pre-pocessing. It contains a series of steps that make data ready for the computers to read and analyze. After successfully completing the first step, the next step involves extracting features from the text for further analysis. All this will be explained here, so let’s get started!!!

This blog introduces you to theoretical concepts, for practical version of it, please refer to python codes on GitHub.

What is NLP?

A field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages.

Data Pre-Processing

It is also termed as data cleaning or data wrangling. It is a process of converting data from the initial raw form into another format, so that the prepared can be used for further analysis.

Why do we need data pre-processing?

  • Data Cleaning: this helps in removing noise and resolve inconsistencies in the data.
  • Data Integration: means merging of data from multiple sources .This process must be carefully implemented in order to avoid redundancies and inconsistencies in the resulting data set . Conflicts within the data are resolved.
  • Data Transformation: Data is normalized, aggregated and generalized

Different steps to perform processing includes removal of:

(1) Lower case

(2)Tokenization

(3) Punctuation

(4) Contractions

(5) Stop words

(6) Alpha numeric characters

(7) Stemming and Lemmatization

Removing lowercase

It is the first step that must be done before starting with any pre-processing tasks. It helps in maintaining consistency of expected output.