Custom Named Entity Recognition with Spacy

Named Entity Recognition (NER), is a technique in Natural Language Processing where the goal is to extract information like names of people, locations, businesses, or anything else with a proper name, from unstructured text. It is an important area of research in machine learning and natural language processing (NLP), because it can be used to answer many real-world questions, such as:

Does a given text contain the name of a person?
Does the given text also provide his current location?
Which product were mentioned in an invoice?
Were specified products mentioned in complaints or reviews?

Spacy has an NER model built-in that you can use in your NLP tasks, but that model not much useful when you need to extract a very specific type of data from the text. So when you want to extract some custom named entities, you will need to create your own custom NER model, which will be the focus of this blog. I will show you how you can create your own named entity recogniser for your needs.

Here, I will use some examples of bank transaction sms'es, and our goal is to extract the mode of the transaction, whether it was credited or debited. How much was the amount, and the Account No. Before that let's try to use Spacy's built-in NER model on one example so that we will understand the need of a custom model.

Example text: "A/c XXXX credited by Rs. 1,000 Total Bal: Rs. 12,024.79 CR Clr Bal: Rs. 12,024.79 CR. Call xxxxx if txn not done by you to block account/card.-xxx"

Spacy's built-in NER:

As you can see, it's not doing well on this type of text, where we need to find specific kind information from the given text.

Step 1: Gather Training Data

Just like any other machine learning task, for creating NER model, you need to gather training training data.

Here we do need the training data in a specific format in order to train the NER model. The format looks like below:

training_data=[
  ('A/c 3XXXXX9280 debited by Rs. 1,000 Total Bal: Rs.  12,024.79 CR Clr Bal: Rs. 12,024.79 CR. Call 1800221911 if txn not done by you to block account/card.-CBoI',
  {
    'entities': [
      (4,14,'account no'),
      (15,35,'debited amount'),
    ]
  }),
  ('A/c 3XXXXX9280 credited for Rs. 5,000 Total Bal: Rs.  13,024.79 CR Clr Bal: Rs. 13,024.79 CR. Call 1800221911 if txn not done by you to block account/card.-CBoI',
  {
    'entities': [
      (4,14,'account no'),
      (15,35,'credited amount'),
    ]
  })
]

Here, Each training example is a tuple containing the raw text of sms and a dictionary with a list of entities found in that text.

Each entity in the list is a tuple containing the character offset indices for where the entity starts and ends in the text, along with the entity label.

In the first example, the entity 'A/c No' starts at index 4 and ends at index 14 and has the label 'account no'.

How did I get the data in this format?

Well, I just used the python library "Regex" to find the span of the given sub-string in the main string, and created a tuple out of it. You can also do this by code or just do it manually.

Step 2: Installing the Spacy

If you already have spacy installed then you can skip this step, if not let's install spacy.

Now to install spacy, open your command-prompt or terminal and type these two commands:

pip install -U spacy  # Installs spacy
python -m spacy download en # installs spacy's language model

first one installs spacy as a normal python package and another one is to download the spacy's pretrained natural language model which is the heart of the spacy.

Step 3: Importing the libraries and loading the language model

Once, you have spacy installed let's move on further by importing it, with couple of other libraries as well which will be needed for our task.

import spacy
from spacy import displacy

displacy is a module which helps us in order to visualize the entities.

Next, we load the english language model using spacy.load()

nlp = spacy.load('en_core_web_sm')

This english model, which we just loaded already has a predefined pipeline built into it, which consists of a tagger, parser and ner. you can find it using print(nlp.pipe_names) command which will print the names of the pipelines. For now, we are only interested in the ner pipeline. so we will get the ner pipe and modify it.

ner = nlp.get_pipe("ner")

once we have the 'ner' pipeline we can train it on our custom data. but before that we want to disable the training of other parts of the pipeline except ner.

disable_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

with nlp.disable_pipes(*disable_pipes) :
    optimizer = nlp.resume_training()

This line, will disable all other pipes except ner, which is important as we are using a pretrained model and we don't want train every single part of the pipeline.

Step 4: Training our own NER model

We start off by adding the entity labels from your training data to the pipeline.

for _, annotations in training_data:
    for ent in annotations.get('entities'):
        ner.add_label(ent[2])

Now the fun part, let's start the training:

import random

for int in range(100):
    print("Starting iteration " + str(int))
    random.shuffle(training_data)
    losses = {}

    for text, annotation in training_data:
        try:
            nlp.update(
            [text],
            [annotation],
            drop = 0.5,
            sgd = optimizer,
            losses = losses
            )
        except: pass
        print(losses)
new_model = nlp

Training process can take awhile.

At each iteration,we shuffle the training data so that the model doesn't learn based on the order of the training data.
The model is updated by calling nlp.update which will step through the words in the input text for each example, and the model attempts to predict the correct label for each word and adjusts the model weights according to whether it has predicted correctly or incorrectly.
A dropout rate of 0.5 helps prevent overfitting by randomly dropping features during training so that the model will be less likely to simply memorize the training data examples.

Step 5: Saving the model to the disk

Once the training process is complete, you can save the model to the disk.

# Output directory
from pathlib import Path
output_dir=Path('/path/nlpmodel')

# Saving the model to the output directory
if not output_dir.exists():
  output_dir.mkdir()
nlp.meta['name'] = 'ner_model'  # rename model
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

Step 6: Loading the saved model

If you have saved your model in the output directory, you can load that model in order to use it again using spacy.load() command.

# Load the saved model and predict
print("Loading from", output_dir)
nlp_updated = spacy.load(output_dir)
doc = nlp_updated("A/c 2XXXXX9499 debited by Rs. 1,000 Total Bal: Rs.  12,024.79 CR Clr Bal: Rs. 12,024.79 CR. Call 1800221011 if txn not done by you to block account/card.-xxxx" )
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])

This will print out the entities on which the model was trained.

Step 7: Visualize the output

In Order to get a nice visualisation, spacy comes with really good module called displacy, which can render really nice visualisation for these entities. let's take a look at it.

rom spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

Search This Blog

Sanket Gadge's Blog

Training a Custom Named Entity Recogniser

Custom Named Entity Recognition with Spacy

Step 1: Gather Training Data

Step 2: Installing the Spacy

Step 3: Importing the libraries and loading the language model

Step 4: Training our own NER model

Step 5: Saving the model to the disk

Step 6: Loading the saved model

Step 7: Visualize the output

Thanks for reading, I hope you were able to create your own NER model using this.

Comments

Post a Comment

Popular posts from this blog

Understanding the German Tax System: A Student’s Quick Guide

Sentiment Analysis On IMDB Dataset