Reducing RAG Pipeline Latency for Real-Time Voice Conversations
Published on November 1, 2024

TL;DR

This article explores methods to reduce latency in Retrieval Augmented Generation (RAG) systems, particularly in real-time voice interactions for applications in customer service/support.


Introduction

With the adoption of Retrieval Augmented Generation (RAG), many organizations find it easier to access the generative capabilities of Large Language Models (LLMs) without resorting to expensive or complex modes of training, such as pre-training, fine-tuning, RLHF, or adapters. RAG enables organizations to build knowledge bases containing vast amounts of information and retrieve only the relevant parts, providing LLMs with a comprehensive answer without incorporating the information directly into the LLM.

What is RAG?

In RAG, a model retrieves relevant documents or information from a large database and then generates a response using that retrieved information. RAG typically uses a two-part system: a retriever and a generator. The retriever searches for relevant documents or passages based on a query, often using methods like sparse, dense, and hybrid retrieval, with semantic (vector-based) search being commonly used for fast and contextually accurate results. The generator (often a model like GPT) synthesizes the retrieved information to generate the final response. This enables the model to provide more accurate and factually correct answers.

Two of the most popular use cases for RAG are:

  1. Customer Service: Exposing a business's external/internal information in a conversational format using virtual agents/assistants/chatbots for end-users and automating FAQs and Q&A modules, resulting in a high containment rate.

  2. Enterprise Search: Making private organizational information searchable and providing more precise answers by connecting to data sources such as Google Drive, Confluence, JIRA, Zendesk, websites, static files, etc.

Voice and Acceptable Conversational Latency

When using RAG for customer service, information retrieval or conversations with LLMs typically occur on voice or digital channels (such as the web, WhatsApp, SMS, etc.). Latency is a critical factor in real-time communications, as it can significantly impact the quality and effectiveness of the conversation. In real-time communication, latency refers to the delay between when a sound is produced by the speaker and when it is heard by the listener.

The voice channel has a low tolerance for latency, and the ITU-T (International Telecommunication Union's Telecommunication Standardization Sector) recommends a a one-way latency of 100ms for interactive tasks and 150ms for conversational use cases involving humans on both ends. Digital channels generally have higher tolerance levels compared to voice channels. This article specifically analyzes voice channel latency when using RAG for human-to-virtual assistant/agent/chatbot interactions.

Achieving a one-way latency of 150ms is almost impossible with a RAG-like architecture where several components are involved in voice processing. However, near real-time experiences can be achieved with the correct optimization of voice processing logic and AI models. The end-users of virtual assistants/agents/chatbots are more delay-tolerant compared to human-to-human conversations, which are highly sensitive to latency.

End-to-End View of a Voice Call

End-to-end view of a voice callEnd-to-end view of a voice call

A voice call pipeline with RAG can be divided into various components. It typically begins with a conversation initiated on a web application or phone by the end-user, with network latency accumulating until the voice reaches the service. Once the voice reaches the service, the first step involves converting it to text using a Speech to Text (STT) service. The transcribed text is then used for information retrieval, and the retrieved context is passed to an LLM to generate a response. This generated response is then sent to a Text to Speech (TTS) service, which converts the text back into voice. Finally, the voice is transmitted back to the web application or phone via the network to be played back to the end-user.

Component/Service Function of the Component/Service
Device/Application/Browser/ Capture audio and encoding
Uplink Network Send the audio over the network
Speech to Text (STT) Convert speech (audio) to text
Information Retrieval Organize and search the knowledge base
LLM Processing Generate response based on context and query
Text to Speech (TTS) Convert text to speech (audio)
Downlink Network Receive the audio over the network
Application/Browser/Phone Receive the audio and decode

Device/Application/Browser

A conversation from the end-user usually starts with a device/application/browser connected to the Internet. The end-user device/application/browser plays a crucial role in latency as it captures the audio and performs the encoding. It also handles the response by receiving the audio and decoding it.

You typically have three options for end devices:

  1. PSTN Phone : If end-users dial physical numbers instead of using the calling function within the application or web browser, the audio needs to be sent to a PSTN gateway before being sent to the Speech to Text or ASR service for transcription.

  2. Mobile Applications : Mobile apps can stream audio over WebSockets or WebRTC. However, WebSockets, which are built on top of TCP, may suffer from performance issues in high-latency networks.

  3. Web Browser : Web browsers are now equipped with WebRTC for real-time communication, allowing low-latency media streaming built on top of UDP.

Recommendation: Low latency is best achieved if the application is based on WebRTC.

The latency conditions of uplink and downlink networks are varied and unpredictable for end-users, making it challenging to offer specific recommendations here.

Speech to Text Service

This step is crucial as the service converts speech to text for further processing. It is important to achieve high accuracy with low latency, as the accuracy of this step affects the overall efficiency of the response pipeline. Various providers, such as Speechmatics, Deepgram, Google ASR, and AWS Transcribe, offer Speech to Text services. Two modes are available:

  1. Pre-recorded (batch) Processing : A recorded audio buffer or file is sent to the Speech to Text provider in chunks. This approach typically involves waiting for silence before processing, which adds latency.

  2. Real-time Streaming : End-user utterances are sent to the STT service as they are received, allowing for faster transcription but potentially introducing accuracy issues. Efficient Voice Activity Detection (VAD) is essential in real-time streaming.

Recommendation: Low latency is better achieved with real-time streaming than with batch processing in Speech to Text services.

Information Retrieval 

The search methods used in RAG are crucial for retrieving relevant, high-quality documents or passages that the generative model can use to produce informative responses. The choice of retrieval method, whether sparse, dense, or hybrid depends on factors like the accuracy of the answers, latency, and computational constraints. Hybrid retrieval methods, which blend sparse and dense techniques, are gaining popularity for their ability to combine precision with recall, making them highly effective in RAG systems.

Vector search, also known as semantic search, typically uses embeddings (vector representations) of text and then searches over those embeddings to find the most similar results. Vector search latency can be reduced by using shorter (but still contextually accurate) embedding vectors and quicker rescoring algorithms to filter out irrelevant context. Implementing caching and local vector databases can also improve search latency.

Recommendation: For real-time conversations, semantic search is generally a better choice due to its speed and efficiency. It allows for near-instantaneous retrieval, which is essential for maintaining the flow of live interactions.

LLM Processing 

The generation part of the RAG system uses an LLM to produce coherent and contextually relevant responses based on the retrieved information and user query. Time to First Token (TTFT) is an important performance metric. It refers to the time it takes for a model to generate the first token of output after receiving a prompt. This metric is crucial in interactive applications such as chatbots, virtual assistants, and real-time content generation.

In batch processing models like Google Chat Bison, TTFT generally refers to the time taken to generate the entire response, which can cause delays since the Text to Speech (TTS) service must wait for the full response to be generated before processing it. In contrast, streaming models like Gemini 1.5 Pro allow TTFT to be measured from the moment the first token is generated, enabling immediate output. This means the TTS service can begin processing and delivering parts of the response as they become available, significantly enhancing the user experience by reducing perceived latency.

The future of RAG may see advancements where the retriever component is minimized or eliminated entirely, achieved through more sophisticated models with larger context windows, such as the Google Vertex Gemini family of models. Several other factors also affect LLM response generation, and optimizations like reducing prompt size or using context caching can reduce latency. In some cases, you might opt for a Small Language Model (SLM) instead of an LLM, trading some accuracy for faster response times. Providers like FireworksAI focus on optimizing models specifically for latency.

Recommendation: LLM output in streaming mode can significantly reduce latency, allowing TTS to play back the generated response sooner. Also, consider faster models like Google Gemini Flash 8B, which offer lower latencies.

Text to Speech Service

This step is crucial, as the service converts the text into speech to be played back or streamed. Providers like Amazon Polly, Google TTS, and Eleven Labs offer Text to Speech services, typically in two modes:

  1. Pre-recorded (batch) Processing : Converts large amounts of text into speech in one asynchronous operation. This is useful for applications where real-time processing is not critical, such as generating audio for e-books, podcasts, or pre-recorded announcements.

  2. Real-time Streaming : Essential for applications requiring immediate speech output, such as virtual assistants, interactive voice response (IVR) systems, and real-time communication tools.

Recommendation: Real-time streaming is preferable for achieving low latency compared to batch processing.

Conclusion 

Latency in real-time voice applications using RAG is influenced by many factors, and various optimization techniques can help control it. The table below summarizes the expected maximum two-way latency in different phases of voice communication. With optimizations, you can expect significant improvements in latency. The table excludes the latencies introduced by the device and uplink/downlink networks.

Module STT Semantic Search LLM TTS Total (Max)
Time to first audio (before optimisation) < 1 sec 300 ms (1) 1.4 sec (2) < 1 sec < 3.7 sec
Expected time to first audio (after optimisation) < 500ms 150-200ms < 1 sec (3) < 500ms < 2.15 sec
  1. Semantic Search - Vector database is based on MongoDB

  2. LLM - without streaming 

  3. LLM - with streaming 

By implementing these optimization strategies, you can drastically reduce latency in voice applications using the RAG pipeline, ensuring smoother and more efficient real-time conversations. 

Get In Touch

Check out our Vonage AI Studio for building low latency voice conversations or get in touch through Vonage Community Slack or message us on X.

Binoy ChemmagateManager at Vonage

Binoy Chemmagate is a product lead for Vonage AI services with over 10 years in the ICT industry, specialising in generative AI APIs and low-code conversational AI platforms. Based in London, he enjoys mentoring future product managers in his free time.

Ready to start building?

Experience seamless connectivity, real-time messaging, and crystal-clear voice and video calls-all at your fingertips.