Text Summarization in Natural Language Processing

10 min readMay 13, 2022

Authors: Sakshi Kulkarni; Pranesh Kulkarni; Shubham Deshmukh; Tejas Rajuskar

Introduction:

Natural language processing is a subset of artificial intelligence that examines computers’ interactions with human languages and analyses vast amounts of data. The most challenging NLP tasks include classification and regression, which require a full-text output rather than a single value and translation, summarization, and conversation. Text summarization is the technique of compressing long pieces of text without removing the semantic structure of the original text. As a result, this challenging and difficult task of text summarization can be addressed using natural language processing such as text classification, news summarization, generation of headlines, and so on.

In the era of big data, when we access a website having an excessive number of articles, the majority of articles are uncertain to be of interest to us. With the help of natural language processing, we can automate text summarization with a fluent summary without any human help while preserving the meaning of the original text document.

Types of Text Summarization:

Text summarization methods are classified as follows-

Based on the input type, it is classified as:

Single Document: Single documents rely on the cohesiveness and infrequent repetition of facts to generate summaries. It is a type of summarization that reduces the length of the text while keeping the essential information.
Multi-Document: Multi-document summarizations increase the probability of unnecessary data and repetition. A set of documents is converted into a short piece of text by preserving key information and filtering out unnecessary information using this summarization technique.

Based on its purpose, it is classified as:

Domain-specific: Domain knowledge is used in domain-specific summarization. Domain-specific summarizers can combine specific context, knowledge, and language to create a precise summary of the content. For example, models can be paired with healthcare terminology to improve learning and summarize scientific texts.
Query-based: The major focus of query-based summarization is natural language inquiries. It’s similar to search engine results, which summarize articles that solve complex topics pertaining to the input data while using specific data. For example, when we type questions into Google’s search box, it often returns web pages or articles that answer our questions.
Generic: In contrast to domain-specific or query-based summarizers, generic summarizers are not programmed to make any assumptions about the content or domain of the text to be analyzed and summarized in all similar inputs. It simply reduces or summarizes the original article’s content.

Based on output type, it is classified as:

Extractive: In extraction-based summarization, the summary is formed by extracting the relevant sentences or phrases from the original text and combining them into a subset of words that represent the most key features of a piece of text. Extractive summarization is a type of machine learning that includes weighting the most important parts of sentences and using the results to generate summaries.
Abstractive: In this output type, models create their own sentences and phrases that aren’t found in the original text, in addition to creating a more coherent summary with alternative words. Even if certain words do not occur in the original texts, abstractive methods select keywords based on semantic comprehension. It uses advanced natural language techniques to read and examine the content in order to generate a new, shorter text that provides the most significant information about the original text.

Text Summarization Models:

Text summarization can be divided into two categories such as extractive and abstractive. The traditional way was first developed, having the main objective of identifying the significant sentences. To begin with, it creates a summary with exact words and sentences from the original text. Then, by learning an internal language representation and paraphrasing the original text’s meaning, it develops more human-like summaries.

The paragraph of input text is given below.

Natural Language Processing (NLP) makes it possible for computers to understand the human language. Behind the scenes, NLP analyses the grammatical structure of sentences and the individual meanings of words, then uses algorithms to extract meaning and deliver output. In other words, it makes sense of human language so that it can automatically perform different tasks. Probably the most popular examples of NLP in action are virtual assistants like Google Assist, Siri, and Alexa. NLP understands written and spoken text like “Hey Siri, where is the nearest gas station?” and transforms it into numbers, making it easy for machines to understand. Another well-known application of NLP is chatbots. They help support teams solve issues by understanding common language requests and responding automatically. There are many other everyday apps you use where you’ve probably encountered NLP without even noticing. When writing an email, consider text recommendations, offering to translate a Facebook post written in a different language, or filtering unwanted promotional emails into your spam folder.

The original paragraph is summarized using natural language processing. Several text summarization approaches include TextRank, LexRank, Latent Semantic Analysis, Seq2Seq, OpenAI’s GPT-3, and BART methods, which are discussed in this blog.

TextRank

TextRank is an unsupervised extractive text summarization approach. TextRank works for any line of text and does not require any previous data for training. It is a graph-based natural language processing ranking method that selects the most relevant sentences in a text and is also based on Google’s PageRank algorithm. The stages of the TextRank algorithm are represented in Figure 2.

Fig.2 Flowchart of TextRank Algorithm, Source: https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/

The first stage is to combine multiple texts into one article. Afterward, when the text is divided into individual phrases, vector representations are found for each sentence in the next phase. After that, the similarity between vector representations is calculated and stored in a matrix. For sentence rank computation, the similarity matrix is converted into a graph with sentences as nodes and similarity scores as edges, and then, finally, the final summary is composed of a set of top-ranked statements.

The TextRank algorithm code and generated results for a given paragraph are shown below:

TextRank text summarization algorithm code

Summarize Text: In other words, it makes sense of human language so that it can automatically perform different tasks. Natural Language Processing (NLP) makes it possible for computers to understand the human language. Probably the most popular examples of NLP in action are virtual assistants like Google Assist, Siri, and Alexa

2. LexRank

Lex Rank is an unsupervised machine learning methodology where the text rank technique is used to summarise given texts. LexRank is more complicated than TextRank. The Lex Rank text summarization model’s primary principle is that it suggests more comparable sentences to the reader. It identifies the least cosine distance between diverse words and stores the most similar words together using cosine similarity and vector-based methods. It is a simple technique that formulates text by considering the centrality value of each node and then distributing that value among its neighbours.

The following are the results using the LexRank algorithm code for a given paragraph:

LexRank text summarization algorithm code

Summarize Text: ‘Natural Language Processing (NLP) makes it possible for computers to understand the human language. Behind the scenes, NLP analyses the grammatical structure of sentences and the individual meanings of words, then uses algorithms to extract meaning and deliver output. In other words, it makes sense of human language so that it can automatically perform different tasks. Probably the most popular examples of NLP in action are virtual assistants like Google Assist, Siri, and Alexa. NLP understands written and spoken text like “Hey Siri, where is the nearest gas station?” and transforms it into numbers, making it easy for machines to understand. Another well-known application of NLP is chatbots. They help support teams solve issues by understanding common language requests and responding automatically. There are many other everyday apps you use where you’ve probably encountered NLP without even noticing. When writing an email, consider text recommendations, offering to translate a Facebook post written in a different language, or filtering unwanted promotional emails into your spam folder.’

3. Latent Semantic Analysis (LSA)

Latent Semantic Analysis (LSA), also defined as Latent Semantic Index (LSI). It uses a bag of words (BoW) approach to build a word matrix frequency of keywords in a document. The rows are keywords, while the columns are documents. Singular value decomposition is used by LSA to identify latent concepts by performing a matrix decomposition on the document-term matrix. SVD is a matrix factorization method that represents a matrix as the product of two matrices.

Where M is an m×m matrix, U is an m×n left singular matrix. Σ is an n×n diagonal matrix with non-negative real numbers. V is an m×n right singular matrix and V* is the n×m matrix, which is the transpose of the V.

The Latent Semantic Analyser (LSA) involves breaking down input into a low-dimensional space. During summarising, LSA has the capacity to store the meanings of a given text. According to one interpretation of such a spatial decomposition technique, singular vectors can capture and describe recurrent word combinations and sequences in data, and the magnitude of the unique value signifies the pattern’s relevance in a document. For a given paragraph, the output using Latent Semantic Analysis is shown below:

LSA text summarization algorithm code

Summarize Text: Probably the most popular examples of NLP in action are virtual assistants like Google Assist, Siri, and Alexa. NLP understands written and spoken text like “Hey Siri, where is the nearest gas station?” They help support teams solve issues by understanding common language requests and responding automatically.

4. Seq2Seq (Sequence to Sequence)

A Seq2Seq model is a supervised machine learning approach used to address any problem using sequential data. Popular sequential information applications include sentiment categorization and recurrent machine translation. A set of words is often used as an input for named entity recognition, and the result is a list of labels for each of the words in the list.

The encoder and decoder are the two primary elements of sequence-to-sequence modeling. An encoder analyses the entire input sequence using a Long Short Term Memory (LSTM) model, with one word transferred into the encoder at each time step. At each time step, the data is analyzed, with the significant information from the input sequence being saved. Whereas the decoder is also an LSTM network that analyses the entire input sequence word by word, it predicts a one-time step delayed sequence. Based on the previous word, the decoder is programmed to predict the next word in the sequence.

5. OpenAI’s GPT-3

The OpenAI team developed a series of deep learning and natural language processing-based models known as the Generative Pre-trained Transformer. GPT-1, GPT-2, and GPT-3 were developed by OpenAI researchers to generate more complicated models that produced more human-like speech. The GPT-2 is used for news writing and coding, and it can also manage numerical correlations between individuals’ names trained with more than 175 billion parameters learned from 45 TB of text collected from the internet.

GPT-2 differs from other text summarization models in that it does not require any modifications to complete the operations mentioned below. Developers can modify the model using instructions by using the “in-text, text-out” API.

Where, GPT-3 is a transformer-based NLP approach that supports translation, answering inquiries, creating poems, concluding tasks, and activities that need thinking along the way, such as rearranging words, which are trained with 175 billion parameters by 2020. The following is the output of GPT-3, a transformer-based NLP text summarization approach.

Summarize Text: NLP is a branch of artificial intelligence that deals with the interaction between computers and human (natural) languages. NLP is used to build applications that can automatically understand and respond to human language, such as chatbots and virtual assistants. NLP is also used for tasks such as text classification

6. BART

A Bidirectional Auto-Regressive Transformer (BART) is a combination of a standard Seq2Seq bidirectional encoder, such as BERT, with a left-to-right autoregressive decoder, such as GPT, developed by Facebook. By analyzing sequences at once and mapping connections between words irrespective of where they’re in the texts, the language models can execute any NLP task. As a result, depending on the context, the same word can have different vectors in its word embedding having billions of parameters to train. The output of the original paragraph is summarized using the BART model, which is shown below.

BART text summarization algorithm code

Summarize Text: ‘Natural Language Processing (NLP) makes it possible for computers to understand the human language’ ‘Probably the most popular examples of NLP in action are virtual assistants like Google Assist, Siri, and Alexa’ ‘NLP analyses the grammatical structure of sentences and the individual meanings of words’

Conclusion:

This blog analyzed six popular text summarization approaches, including TextRank, an unsupervised machine learning methodology, and two types of supervised machine learning methods, such as Seq2Seq which is based on word embedding, and pre-trained BART. An automatic text summarization’s advantages extend to resolving immediate issues. By automatically generating text summaries, content editors save time and effort that would otherwise be spent manually creating article summaries. The following are some significant advantages of text summarization:

Instantly effective: Reading the full article, analyzing it, and extracting the important concepts from the raw text takes a lot of time. Automatic text summarization allows individuals to quickly summarize a piece of information within seconds, and it saves the reader time. a quick response, which minimizes the amount of effort required by the user to locate necessary details.
Increase productivity level: Test summarization improves productivity by allowing the user to review the contents of a text for accurate, brief, and precise information. As a result, the tool allows more time for the user by reducing the amount of text and boosts productivity by allowing the user to concentrate on important tasks.
Include all critical information: Automated software does not skip important details that the human eye does. Every reader wishes to be able to extract the most relevant information from any piece of writing. The user can easily obtain all of the key information in a document using the automatic text summarizing technique.