Unlock Faster LLM Serving with vLLM: A Step-by-Step Guide

Christian Grech
4 min readFeb 25, 2024

--

Introduction

Large Language Models (LLMs) have revolutionized artificial intelligence, offering promising applications across various sectors. However, deploying these models for real-world use poses challenges, particularly in terms of speed and resource allocation. Enter vLLM, an innovative solution designed to enhance LLM serving while ensuring optimal utilization of computational resources. Developed at UC Berkeley, vLLM has already proven effective in production environments like Chatbot Arena and Vicuna Demo. By intelligently partitioning Key-Value caches into smaller, dynamic blocks called ‘pages,’ a mechanism called PagedAttention minimizes memory waste and optimizes GPU utilization.

In this blog post, you’ll learn how to leverage vLLM for faster LLM serving using Python code. You will find all the documentation and examples for vLLM here.

In this article we will compare running four prompts with Mistral-7B with and without vLLM, and compare the duration in both cases.

vLLM promises improved speed and efficiency when serving LLMs through the use of PagedAttention

Installing Dependencies

Begin by installing the required packages in your environment or Google Colab notebook. In my case I am using a V100 GPU for my runtime:

!pip install vllm accelerate transformers torch

Text Generation Example

For this tutorial, let’s work with the Mistral-7B-Instruct-v0.2 model provided by Mistral AI. First, let’s import necessary libraries and initialize the text pipeline. This text pipeline will be used to compare results when we eventually apply the vLLM framework.

from pathlib import Path
from huggingface_hub import notebook_login
from transformers import PreTrainedTokenizer, pipeline
from tqdm import tqdm
notebook_login()
import torch
import time

text_pipeline = pipeline(
"text-generation",
model="mistralai/Mistral-7B-Instruct-v0.2",
torch_dtype=torch.float16,
device_map="auto"
)

When asked for an API, provide your HuggingFace API token which you can get for free from the settings section of your HuggingFace account. Next, prepare a list of questions for the model to answer:

texts = [
"What are the pros and cons of ChatGPT vs Open Source LLMs?",
"Write an email to a client to offer a subscription for a paper supply for 1 year.",
"I have 10,000 USD for investment. How should one invest it during a period of high inflation and high mortgage rates?",
"Write a function in python that calculates the square of a sum of two numbers"
]

Now, define a helper function create_prompt to format user inputs correctly for the language model:

def create_prompt(text:str, tokenizer:PreTrainedTokenizer) -> str:
messages = [
{
"role": "user",
"content": "You are a friendly chatbot that always replies as superhuman intelligence AI"
},
{
"role": "assistant",
"content": text
}
]
return tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)

Generate prompts using the above function:

prompts = [create_prompt(text, text_pipeline.tokenizer) for text in texts]
for prompt in prompts:
print(prompt)
print("\n\n")

Save the prepared prompts to be able to load at a later stage:

folder_path=Path("prompts")
folder_path.mkdir(parents=True, exist_ok=True)
for i, prompt in enumerate(prompts, start=1):
file_path = folder_path / f"prompt_{i}.txt"
file_path.write_text(prompt)

The prompts are now saved as text files which we can load when inferring the original LLM as well as the LLM with vLLM.

Introducing vLLM

Before proceeding further, make sure to understand what vLLM offers. According to UC Berkeley, vLLM promises:

* Fundamental improvements in LLM deployment for diverse industries
* Up to 24x higher throughput than Hugging Face Transformers without modifying the underlying model architecture
* Based on PagedAttention, an advanced attention management technique. By intelligently partitioning Key-Value caches into smaller, dynamic blocks called ‘pages,’ PagedAttention minimizes memory waste and optimizes GPU utilization.

When tested against Hugging Face Transformers and Hugging Face Text Generation Inference, vLLM demonstrated superior performance:

* Single-output completion: 14x — 24x higher throughput than Hugging Face Transformers; 2.2x — 2.5x higher throughput than Hugging Face Text Generation Inference
* Three parallel output completions: 8.5x — 15x higher throughput than Hugging Face Transformers; 3.3x — 3.5x higher throughput than Hugging Face Text Generation Inference

Using the original LLM for Inference

Let’s run the pipeline for the original LLM and save the responses for the four prompts. The execution duration is measured using the time module:

start = time.time()
hf_responses = []
for prompt in prompts:
hf_responses.append(text_pipeline(prompt, max_length=256, do_sample=True, temperature=0.1)['choices'][0]['text'])
duration_hf = time.time() - start

In this case, the duration was 195 seconds to generate responses for all four prompts.

Using vLLM for Inference

Let’s integrate vLLM into our current workflow and analyze the differences. Start by loading vLLM:

from vllm import LLM, SamplingParams

llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.2", dtype=torch.float16, max_model_len=2700, enforce_eager=True)

I decrease the max_model_len parameter to 2700 due to cache blocks limits on Colab when initializing the model. The responses are generated and the execution time is measured:

start = time.time()
vllm_responses = []
for prompt in prompts:
vllm_responses.append(generate_response_with_vllm(prompt, llm))
duration_vllm = time.time() - start

The duration in this case was 23 seconds, an impressive 88% decrease from the original implementation

Conclusion

By integrating vLLM into your LLM serving infrastructure, you will experience notable performance gains, enabling quicker processing and lower resource consumption. Adopting vLLM ensures better scalability and cost reduction, especially when dealing with large language models in intensive computing tasks. Follow this guide step by step to improve your LLM serving capabilities!

--

--

Christian Grech
Christian Grech

Written by Christian Grech

Christian Grech is a Software Engineer / Data Scientist working on the development of atom-sized quantum sensors in diamonds.

No responses yet