Workshop

The NLP
Canvas!


Presenting at ICDAR 2021 Venue Lausanne, Switzerland Date September 5-10, 2021

About

NLP techniques to build a
Text Summarization System

Have you ever wondered how machines understand Natural Language? How “Google Translate” works? How “Siri”, a robotic voice, responds to your voice commands? Or how a piece of software understands a text document and does automatic summarization or extract relevant sentences? The answer to all these questions lie in this workshop where we explore the astounding domain of Natural Language Processing (NLP). We will unveil the very concepts of NLP with the help of your notion of how you understand the natural human language. In this tutorial, we will present you the nitty-gritties and the pipeline process of NLP with a real world example of text summarization. The tutorial also includes interesting exercises to gain intuition in the NLP domain and a code walk through for text summarization task. Summarization models can be trained and applied across a range of domains and with diverse applications and can save an immense amount of time reading a large-content document.

Outline

Combination of theory
and practical examples

The overall session will be a combination of some basic theory and hands on examples about the building blocks of the NLP. At the end, the participants would train a text summarization model and watch it in action.
Total duration of the tutorial will be 2.5 hours.
  • What is Machine Learning?
  • What is Natural Language Processing?
  • Understanding the NLP Pipeline!
  • Preprocessing
    Cleaning
       ➢ Regular Expression
    Chunking
       ➢ Paragraph Detection
       ➢ Sentence Boundary Detection
       ➢ Sentence to Words

    Feature Engineering for ML techniques
    Words: Meanings, Synonyms, Antonyms, Part Of Speech (Verb, Adverb, Cardinal) etc.
    Named Entity Relation, Dependency Parsing, Coreference Resolution etc.
    Word Normalization: Lemmatization and Stemming
    Keyword recognition
    WordNet, Synsets, Stanford Core NLP Parser, spaCy, NLTK

    Vector representation and Word Embeddings
    ■ Why vector or embedding is required?
    ■ Bag of Words, n-gram Model
    ■ Skip - gram model
    ■ Count Vectorizer
    ■ Term Frequency - Inverse Document Frequency Vectorizer
    ■ Hashing Vectorizer
    ■ Automatic Feature selection and vector representation in DL techniques
      
    Word2Vec
      
    GloVe
    ■ Transformer (encoder) based embedding

    ■ Sample code walkthrough (word vectorization)

  • Scikit learn (sklearn) library to the rescue!
  • Data and ML cookstart!!
    ■ Corpus
    ■ Training, Testing and Validation Phase
    ■ K-fold Cross Validation

    Training the Machine Learning Model
    ■ Evaluation Metrics

    How to make your model better and improve the performance?
    ■ Error Analysis

    Code walkthrough
    ■ Model for spam/ non spam detection

  • Transformer model
  • Understanding transformers
    ■ Overview of architecture

    ■ Attention mechanism

    ■ Pre Training techniques - MLM, BERT

    ■ Fine Tuning for downstream tasks

  • Code for text summarization
  • Build a text summarization model
    We’ll discuss the 2 main types of summarization objectives- extractive and abstractive summarization. After a discussion on the overall pre-training procedure and working of the T5 language model, we’ll walk through the process of fine tuning a large language model for the abstractive summarization task on a custom dataset. Participants will run and generate summaries on their models using google colab.

  • Question and Answers (15 mins)

Team

Dhara Kotecha Tutorial Presenter dhara@infocusp.in
Falak Shah Tutorial Co-Presenter falak@infocusp.in
Nisarg Vyas Tutorial Advisor nisarg@infocusp.in
Vinish Lonhare UI Designer vinish@infocusp.in
Akash Shah UI Developer akash@infocusp.in