Hierarchical Sorting of Legal Text Using NLP

6 min readJan 20, 2024

NLP can help Legal companies streamline the review of legal documents. Image generated using OpenDALLEv1.1

In the ever-evolving landscape of Legal Tech, one of the key challenges faced by companies revolves around the intricate task of reconstructing hierarchical structures within regulatory documents after parsing. The ability to establish clear parent-child relationships between paragraphs is pivotal for organizing data in a way that not only benefits our customers but also fuels various machine learning applications.

In response to this challenge, I have developed a robust system capable of organizing a given list of paragraphs into a hierarchy by defining precise parent-child relationships between them. In this article, I will walk you through the process, share insights into the dataset used, and discuss the solution’s performance.

Tech used in this project: Python, Machine Learning, Classifiers, HTML, Natural Language Processing, Recursive functions

The code and data are publicly available on GitHub. This article will go through the main Jupyter Notebook script Hierarchy.ipnyb.

Dataset Description

The dataset, stored in data.json, comprises 785 paragraphs arranged in reading order. Each paragraph is uniquely identified by an ‘id’, and the ‘parent_id’ field holds a reference to the parent paragraph. If the ‘parent_id’ is null, it indicates that the paragraph has no parent. Additionally, the ‘html’ field provides the HTML representation of the paragraph.

The real test comes with test.json, which contains 48 paragraphs from a legal document similar to the one used for data.json. Similar to the training set, ‘id’ is the unique identifier, ‘html’ is the HTML representation, but the crucial ‘parent_id’ field is not available. It is the task of the model to predict the ‘parent_id’ based on the learned hierarchical relationships.

Preprocessing

The first step involves importing data from the two JSON files and preprocessing it for further analysis.

# Import the two datasets
df = pd.read_json('data.json')
test_df = pd.read_json('test_data.json')

The html text is sliced to the first 25 characters, as we will be extracting the html tags surrounded by the < > symbols.

df['short_html'] = df['html'].str.slice(0, 25)# Slice the html code as we are only interested in the HTML tags

A recursive function is used to define the hierarchical level of each text, with 0 being the root level for texts without a parent id, and the number increasing going down the hierarchy. The hierarchical level will be the target column of our model.

# Create a dictionary to store the hierarchical level for each paragraph
hierarchical_levels = {}

# Initialize the hierarchical level for the root paragraphs
for i, row in df.iterrows():
    if row['parent_id'] is None:
        hierarchical_levels[row['id']] = 0

# Recursively update the hierarchical level for child paragraphs (Source: Bard)
def update_hierarchical_levels(paragraph_id):
    if paragraph_id not in hierarchical_levels:
        parent_id = df[df['id'] == paragraph_id]['parent_id'].values[0]
        hierarchical_level = hierarchical_levels[parent_id] + 1
        hierarchical_levels[paragraph_id] = hierarchical_level
        update_hierarchical_levels(parent_id)

for i, row in df.iterrows():
    if row['parent_id'] is not None:
        update_hierarchical_levels(row['id'])

# Add the hierarchical level to the DataFrame
df['hierarchical_level'] = df['id'].apply(lambda id: hierarchical_levels[id])

Let’s see an example of the dataframe at this stage:

An example of the dataframe with the absolute hierarchical levels defined

HTML tags are extracted as features to identify patterns within paragraphs and headers. This step is crucial in training a model that understands the inherent structure of legal documents. The vectorizer is hence applied with a token pattern that is able to extract tags such as <div>, <p>(a), <p>(i) etc… This is critical as this determines whether the text is a header or part of a list.

vectorizer = CountVectorizer(analyzer='word', lowercase=False, token_pattern=r'\W+[A-Za-z\d]*\W\d?')

Model Training

A training-validation split is performed to ensure the model’s performance can be accurately evaluated. The model takes as input the vectorized html text and predicts the hierarchical level.

Multiple classifiers are trained, and their performance is evaluated using the validation dataset. The goal is to identify the best-performing classifier for the given task. The eight models trained are “Nearest Neighbors”, “Gaussian Process”, “Decision Tree”, “Random Forest”, “Neural Net”, “AdaBoost”, “Naive Bayes” and “XGBoost”. Below are the performances based on the validation dataset.

Accuracy and Balanced Accuracy for each trained model

The XGBoost Classifier emerges as the top performer. It exhibits sound performance metrics: an Accuracy of 0.96 and a Balanced Accuracy of 0.85. These metrics underscore the model’s proficiency in accurately predicting the hierarchical relationships between paragraphs. The confusion matrix based on the validation dataset is shown below.

The model is then retrained with the full dataset and employed on the test dataset to predict ‘parent_id’ based on the test_data.json file.

Generating HTML file

An HTML file to display the visible indentations is generated as shown below. The test dataframe is iterated and an indent based on the predicted

# Generate HTML with indentations (indentations are defined as 4 times the level number)
def generate_html(node, level=0):
    html = f"<div style='margin-left:{level*4}em'>{node}</div>"
    return html

# Generate HTML for the entire test hierarchy
test_html = ""
for i, row in test_df.iterrows():
    test_html += generate_html(row['html'],int(row['level']))

# Save test.html
with open('test.html', 'w') as f:
    f.write(test_html)

Improvements

While the model performs well, there is always room for enhancement. Key areas for improvement include:

Considering switching from absolute hierarchical levels to relative levels to the root level. This helps for cases with a larger number of levels not used for training the model.
Finetuning Feature Extraction: Refining the extraction of HTML tags and exploring additional features could further improve the model’s understanding of the document structure.
Optimizing Model Parameters: Adjusting the parameters of the XGBoost model and exploring other algorithms might lead to even better results.
Predicting Relative Relationships: Instead of predicting numerical levels, exploring methods to predict relative relationships between paragraphs could provide more nuanced and context-aware results.

Applications

The application of this product in the realm of Regulatory Legal Tech holds significant potential for streamlining compliance processes, enhancing document organization, and ultimately contributing to the efficiency of regulatory compliance within the industry. Here are some notable applications:

Regulatory Compliance Management:

The product can be employed to automatically organize and structure complex regulatory documents, ensuring a clear hierarchy of rules, standards, and compliance requirements.

Automated Compliance Audits:

By utilizing hierarchical sorting, companies can automate the creation of audit trails for compliance checks. This helps in easily tracking changes, ensuring adherence to the latest regulations, and expediting the auditing process.

Accelerated Document Review:

The hierarchical sorting system enables faster and more accurate document reviews. This is especially crucial in the automotive industry, where compliance timelines are often critical.

Conclusion

In conclusion, the developed system offers a powerful solution for organizing legal documents hierarchically. As we continue to refine and enhance our approach, we aim to provide Legal Tech companies with a tool that not only meets but exceeds their expectations in navigating the complex landscape of regulatory documents.