Named Entity Recognition using roBERTa Base Large Language Model

4 min readJan 24, 2024

Named Entity Recognition (NER) has potential in extracting entities from large raw text bodies. Image generated using OpenDalleV1.1

In today’s fast-paced world, extracting valuable information from vast datasets is crucial. Named Entity Recognition (NER) is a fundamental task in natural language processing that involves identifying entities such as people, locations, and organizations within a text. This project leverages the power of the roBERTa-Base-Multinerd Large Language Model (LLM) for NER, specifically targeting company names within news articles.

The primary objective is to process news articles from the news_articles-new.jsonl dataset, identifying and linking potential companies mentioned in these articles. The results are captured in a file named news_articles-linked.jsonl, following the format of the news_articles-gold dataset, where companies are listed under the ‘annotation’ field. The news_articles-gold dataset already has companies identified and is used to evaluate the performance of the model chosen in this project.

(1) Example from the news_articles-gold dataset (2) Example from the news_articles-new dataset, where the companies are not extracted

The model chosen incorporates company identifiers from the company_collection, and any new companies not part of the collection are added with an empty string as the value, creating a dictionary like {'Apple':'apple.com', 'MalWart': ''}. For additional details on the roBERTa-Base-Multinerd LLM, refer to the Hugging Face model documentation here.

The code and data are publicly available on GitHub. This article will go through the main Python script main.py.

Technologies used: Python, HuggingFace, Natural Language Processing, Named Entity Recognition.

Performance Metric

The project employs accuracy as the performance metric for evaluating the model. Accuracy is defined as the ratio of correct predictions to the total number of annotations in the news_articles-gold dataset. This metric is chosen because alternative predictions for companies not in the company collection file may still be valid.

Prerequisites

Ensure the presence of the HUGGINGFACEHUB_API_KEY in the .env file in the project folder. Use the following format: HUGGINGFACEHUB_API_KEY=xxx. Access the file from the terminal with:

nano .env

Activate the virtual environment. If not available, create a new virtual environment using the provided requirements.txt file:

python3 -m venv venv
. venv/bin/activate
python3 -m pip install -r requirements.txt

Finally run the Python script:

python main.py

Wait for the script to finish, and a new news_articles-linked.jsonl file will be generated.

Method

In the first case, the model is evaluated using the news_articles-gold dataset, to obtain the percentage accuracy of the method.

Data Preprocessing

The preprocessing carried out in this project is minimal, which is removing unnecessary stopwords from the nltk python package and stripping whitespaces before and after each word.

# Function to remove stopwords
def clean_text(df):
    stopword_pattern = {'|'.join([r'\b{}\b'.strip().format(w) for w in stop_words]): ''}
    return (df.assign(text_cleaned=lambda df_: 
                  df_.text.replace(stopword_pattern, regex=True)))

Querying model using the HuggingFace API

A config object is used to derive the API token stored in the .env file. Subsequently this API token is added to the header parameter as shown below:

conf = ConfigObj('.env')
API_URL = "https://api-inference.huggingface.co/models/jayant-yadav/roberta-base-multinerd"
headers = {"Authorization": "Bearer "+ conf['HUGGINGFACEHUB_API_TOKEN']}

The payload to be sent to the model is prepared by iterating through all the articles and sending each article text one by one.

# Iterate all articles
for idx, row in df_text.iterrows():
    output = query({"inputs": row['text']}) # Model output

The following is the function used to query the model and extract the detected organizations from the result.

# Query function to reach API and get predictions
def query(payload):
    try:
        response = requests.post(API_URL, headers=headers, json=payload)
        keyValList = ['ORG'] # Extract detected organizations
        result = [d for d in response.json() if d['entity_group'] in keyValList and d['score'] > 0.2]
        return result
    except:
        print('Unable to get data.')
        pass

Matching predicted company names with those in the collection

To match the predicted company names with those in the collection, company names are lower-cased, stripped from white-spaces and punctation is removed, to improve the chances of a match.

Evaluation Results

Upon evaluation, the model achieves a 64% accuracy based on provided annotations. Notably, it extracts over 70 new company names, suggesting potential fruitful predictions.

New Company Suggestions

The method is then repeated using the news_articles-new dataset. The model suggests several companies not present in the company collection. Some notable examples include 'TenEleven Ventures', 'TCL Capital', 'Takasago Thermal Engineering Co.', 'University of Pennsylvania', 'Abingworth' . Exploring these suggestions could uncover valuable insights.

Methodology Rationale

Given the time constraints, utilizing a pre-trained model on a related dataset is a pragmatic approach. Challenges such as unclean data are addressed by preprocessing company names — stripping whitespaces, converting to lowercase, and extracting punctuation marks.

Areas for Improvement

Enhance handling of API errors for seamless data retrieval.
Investigate companies not predicted from the gold file and address any underlying issues.
Explore more robust pre-trained models based on larger datasets.
Consider fine-tuning the model once the collected companies list is substantial, requiring ample computational resources.

This project demonstrates the potential of leveraging advanced language models for NER tasks, paving the way for improved information extraction from unstructured text data.