Unleashing the Power of Large Language Models: A Retrieval-Augmented Generation Project

Christian Grech
6 min readJan 23, 2024

--

Retrieval-Augmented Generation has potential to be used as a virtual assistant for large amounts of text data. Image generated using OpenDalleV1.1

Introduction

In the dynamic world of natural language processing, staying at the forefront of cutting-edge technologies is essential for innovation. In this blog post, we’ll delve into the development of a question-answering system based on LangChain and GPT3.5, utilizing scientific publications from Arxiv.org as a rich source of information.

The code and data are publicly available on GitHub. This article will go through the main Python script main.py.

Technologies used in this project: Python, Langchain, GPT3.5, APIs, Retrieval Augmented Generation.

What is Retrieval-Augmented Generation?

Retrieval-augmented generation (RAG) is a natural language processing (NLP) concept that combines the strengths of both information retrieval and generative models to enhance the quality and relevance of generated content. In this approach, a retriever collects data from a pre-existing knowledge base or dataset. This retrieved information is then used as context or input for a generative model, such as a language model, to produce coherent and contextually informed output. In this case, we will use data from Arxiv.org as the dataset and GPT3.5 will be using this data as context for questions and answers about the LLAMA-2 model.

Prerequisites

Make sure you have installed the required Python package:

  • langchain for orchestration
#!pip install langchain

Additionally, define your relevant environment variables in a .env file in your root directory. To obtain an OpenAI API Key, you need an OpenAI account and then “Create new secret key” under API keys.

OPENAI_API_KEY="<YOUR_OPENAI_API_KEY>"

Then, run the following command in the main file to load the relevant environment variables.

import dotenv
dotenv.load_dotenv()

Data Collection and Preparation

The first step in our journey involves collecting data from the Arxiv API and storing it in a structured format, either as JSON or TXT files. This ensures easy accessibility and manipulation of the data for subsequent processing. The function takes the output format as input, either “json” or “txt”. We limit the acquisition of papers here

def fetch_papers(fmt):

"""Fetches papers from the arXiv API and returns them as a list of strings."""
url = 'http://export.arxiv.org/api/query?search_query=ti:llama&start=0&max_results=70'
response = urllib.request.urlopen(url)
data = response.read().decode('utf-8')
root = ET.fromstring(data)
papers_list = []
# Iterate entries
for entry in root.findall('{http://www.w3.org/2005/Atom}entry' ):
title = entry.find('{http://www.w3.org/2005/Atom}title').text
summary = entry.find('{http://www.w3.org/2005/Atom}summary').text
paper_info = f"Title: {title} Summary: {summary}"
papers_list.append(paper_info)

# Check if the list has some content and save to either json or txt format
if fmt == 'json':
dic = {}
dic['content'] = papers_list
if papers_list != []:
logging.info('Writing data to data.json')
with open('data.json', 'w', encoding='utf-8') as f:
json.dump(dic, f)
else:
logging.info('Using existing data.json file')

if papers_list != [] and fmt == 'txt':
logging.info('Writing data to data.txt')
file = open('data.txt','w')
for item in papers_list:
file.write(item)
file.close()
else:
logging.info('Using existing data.txt file')

return

Text Embeddings with Langchain

Langchain, a powerful tool, comes into play for text processing. To load the data, You can use one of LangChain’s many built-in DocumentLoaders. A Document is a dictionary with text and metadata. To load text, you will use LangChain’s TextLoader and JSONLoader for json files.

Because the Document, in its original state, is too long to fit into the LLM’s context window, we need to chunk it into smaller pieces. LangChain comes with many built-in text splitters for this purpose. For this simple example, we can use the RecursiveCharacterTextSplitter with a chunk_size of about 1500 and a chunk_overlap of 200 to preserve text continuity between the chunks.

To enable semantic search across the text chunks, we need to generate the vector embeddings for each chunk and then store them together with their embeddings. To generate the vector embeddings, we can use the OpenAI embedding model


# Load data from file
if fmt == 'json':
loader = JSONLoader(
file_path="./data.json",
jq_schema='.content[]',
text_content=True)
docs = loader.load()
else:
loader = TextLoader("./data.txt")
docs = loader.load()

# Split text and create vector embeddings
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200, separators=["\n\n", "\n", " ", ""])
splits = text_splitter.split_documents(docs)

vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

GPT-3.5-Turbo: The Workhorse

The heart of our project is the GPT-3.5-Turbo model, a state-of-the-art language model by OpenAI. In this project, the model is finetuned to incorporate the collected data as context, enabling it to generate more contextually relevant responses. This strategic finetuning aligns the model with the specific domain of LLAMA-2, enhancing its performance for our targeted question-answering system.

RAG implementation with Langchain

Step 1: Retrieve

Once the vector database is created, you can define it as the retriever element, which gets the additional context based on the semantic similarity between the user query and the embedded chunks.

retriever = vectorstore.as_retriever()

Step 2: Augment

Subsequently, in order to enhance the prompt with supplementary context, it is essential to create a template for the prompt. The prompt template allows for convenient customization, but we will use a default template in this case.

prompt = hub.pull("rlm/rag-prompt") # Using default RAG prompt

Step 3: Generate

Ultimately, we construct a pipeline for the Retrieval Augmented Generation (RAG), linking the retriever, the prompt template, and the Language Model (LLM). Once the RAG chain is established, you can then call and activate it.

    
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


def format_docs(docs):
return "\n".join(doc.page_content for doc in docs)

# Create RAG chain
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)

# Prompt user for input, and break when the input is an empty string
while True:
val = input("Enter your prompt: ")
if val == '':
break
print(rag_chain.invoke(val))

Examples

To interact with the system after setting up the environment and activating the virtual environment, run the Python script. The script prompts users to input questions, and GPT-3.5-Turbo generates responses based on the fine-tuned model. Below are some interaction examples:

Enter your prompt: What can you find out about the model structure of Llama-2?
Llama-2 is a collection of pretrained and fine-tuned large language models ranging from 7 billion to 70 billion parameters. The fine-tuned Llama-2 models, called Llama 2-Chat, are optimized for dialogue use cases and outperform open-source chat models on most benchmarks. The model structure of Llama-2 is described in detail in the provided context.

Enter your prompt: For which tasks has Llama-2 already been used successfully? What are promising areas of application for Llama-2?
Llama-2 has been successfully used for dialogue use cases, as the fine-tuned Llama 2-Chat models outperform open-source chat models on most benchmarks. Llama-2 has also been successfully used for multitask analysis of financial news, including tasks such as analyzing a text from financial market perspectives, highlighting main points, summarizing a text, and extracting named entities with sentiments. Promising areas of application for Llama-2 include language modeling in underrepresented languages like Tamil, where it has shown significant performance improvements in text generation.

Enter your prompt: Name at least 5 domain-specific LLMs that have been created by fine-tuning Llama-2.
Lawyer LLaMA, Llama 2-Chat, fine-tuned Llama 2 GPT model for financial news analysis, LLaMAntino family of Italian LLMs.

Notes and Future Enhancements

The project accounts for potential issues with API access, defaulting to locally stored files when necessary. Additionally, there is room for improvement through fine-tuning parameters such as chunk size and temperature to enhance overall performance.

In conclusion, this retrieval augmented generation project showcases the immense potential of LLMs in creating powerful, domain-specific language models. By combining data collection, text embeddings, and strategic finetuning, we unlock a world of possibilities for question-answering systems. As we continue to refine and explore these technologies, the integration of LLMs into various projects becomes an exciting prospect for future innovations in natural language processing.

--

--

Christian Grech

Christian Grech is a Software Engineer / Data Scientist working on the development of atom-sized quantum sensors in diamonds.