Intro
AI Studio offers a wide variety of Text-to-Speech (TTS) integrated languages and voice styles, enhanced by Speech Synthesis Markup Language (SSML) for creating human-like utterances. There are many more TTS options out there that you may want to use instead. AI Studio allows you the flexibility to connect with any 3rd party that has accessible REST API endpoints. In this post, we'll demonstrate how to connect AI Studio with Deepgram.
An engaging, humanlike TTS experience is critical for voice agents because it fosters a natural and relatable interaction, making users feel understood and valued. This level of engagement enhances overall customer satisfaction and loyalty, as it reduces frustration and improves communication efficiency.
This blog post will cover the use of statically generated speech audio files as part of a Speak or Collect Input node. An upcoming blog post will cover dynamically generated speech audio files for complete customization of the Agent <> Human interaction.
Project Overview
In this blog post, we’ll discuss how to use Vonage AI Studio with a third-party speech synthesis provider, illustrated with a toy Electronic Health Record (EHR) application. We will demonstrate an inbound call from a patient to a physician’s office. The Studio application is built to collect information about the user via the Calling Line ID (CLID), or phone number of the inbound user. The Studio agent will use a webhook to collect information about the patient, such as the patient’s first and last name, the patient identification number, and whether or not there are any scheduled appointments. The user will be greeted using the DeepgramAura TTS model.
Prerequisites
Vonage AI Studio Account
DT API Account
To complete this tutorial, you will need a DT API account. If you don’t have one already, you can sign up today and start building with free credit. Once you have an account, you can find your API Key and API Secret at the top of the DT API Dashboard.
How to Create a Voice Agent
Step 1: Create Your Agent
From the AI Studio Dashboard, select Create Agent. Then since this is a voice use case, select “Telephony”.
Step 2: Configure Your Agent
Agent configuration is a required first step and more information can be found here. In our case, we are providing some basic agent constructs including localization, assignment of the agent to a specific API key or subaccount, and the language for the agent.
Please note that the Voice/Telephony agent does require that you choose a voice. However, this configuration construct is not pertinent because we will be using 3rd party (Deepgram) voices. Your account will not be charged for any Vonage TTS usage in this case, as long as you are using the approaches mentioned here.
Step 3: Choose Your Template
Choose the “Caller Identification” template. We will be using the inbound Calling Line ID (CLID) of the caller to provide a customized greeting. This template gives you a great starting point.
Step 4: Choose Inbound Call
Choose the Inbound Call option. Our voice agent will respond to an incoming call but this does not constrain the agent to only inbound calling. For example, if you want to send a follow-up email after the engagement with the agent is complete, you can accomplish this as part of the Inbound Call flow. Learn more about AI Studio Conversation Events.
Voilà, you are ready to start building your Voice agent with Deepgram’s customized voices. Let’s go!
How to Integrate Text-To-Speech Audio Files with AI Studio
It’s important to understand that in a telephony agent, two nodes can be used for customized TTS. You can read more about the Speak node and the Collect Input node at the links. We will focus on the Collect Input node for this blog post. In my agent, as described, there are several nodes present that the caller interacts with, before hitting the Collect Input node. See below:
You can see the following nodes that are interacted with by the caller.
The Start node.
The patient_webhook node is used to collect information about the caller from the backend database in the EHR.
A conditional node called Existing_patient is used to determine what flow to send the user into next. In this case, since the webhook node matches the inbound caller to an existing patient (via the CALLER_PHONE_NUMBER predefined parameter), the caller is pushed into a flow that provides self-service automation for appointment creation, update, and cancellation, as well as self-service prescription refill requests.
The Existing_patient Collect Input node is used to play the predefined greeting that was sourced from Deepgram.
Using this approach, conversational designers can create speech audio files with any provider’s API endpoint and use the returned file in a Collect Input node.
How to Create Custom Text-To-Speech Audio Files with Deepgram
Programmatically generate synthesized audio using a TTS provider’s API (e.g., Deepgram) or via their user interface. These will be returned to your application as a binary stream, and you can encode the stream as a .mp3 file extension. Below is a sample application that can be used to create audio files through Deepgram’s Aura endpoint.
Learn how to get started with Deepgram's Aura Text-to-Speech API.
import requests
import os
from os.path import join, dirname
from dotenv import load_dotenv
from speech_patterns import SPEECH_PATTERNS
dotenv_path = join(dirname(__file__), ".env")
load_dotenv(dotenv_path)
deepgram_password = os.environ.get('deepgram_password')
deepgram_url = "https://api.deepgram.com/v1/speak?model=aura-asteria-en"
headers = {
"Content-Type": "application/json",
"Authorization": f"Token {deepgram_password}" }
def deepgram_tts():
print("Starting the process")
'''
Sends text strings to Deepgram for speech synthesis and saves the returned MP3 files.
'''
for key, text in SPEECH_PATTERNS.items():
print(f"Processing key: {key}")
payload = {"text": text}
response = requests.post(deepgram_url, headers=headers, json=payload)
print(response.headers)
if response.status_code == 200:
filename = f"{key}.mp3"
# Make sure to use `response.content` for binary content!
with open(filename, 'wb') as file:
file.write(response.content)
print(f"File saved successfully as {filename}.")
else:
print(f"Error: {response.status_code} - {response.text}")
if __name__ == "__main__":
deepgram_tts()
Note - in the above code, the import fromspeech_patternsimportSPEECH_PATTERNS is a reference to an associated file that has all of my desired speech patterns that will be sent to Deepgram, in this format:
SPEECH_PATTERNS = {
"system_greeting": "Hello, and welcome to Stonebridge Dermatology and Aesthetics",
"patient_query": "Hello it's nice to have you back with us...What would you like to accomplish today?",
"repeat_query": "I did not get that...Would you mind stating your need again, or simply use your phone to enter the appropriate digit?",
"type_of_appointment": "Great! What type of appointment would you like to schedule? For a physician, please either say physician or doctor, or what you are concerned about. For example, you could say skin cancer. For aesthetician services, please either say aesthetician or consultant, or the name of a treatment you have in mind.",
"request_for_date": "Understood, Do you have a preferred date in mind?",
"appointment_time": "Now please tell me an appointment time that works for you.",
"appointment_coordination": "Great, thank you! Please wait for a few moments while I check the system to see if that is available.",
"appointment_confirmation": "Perfect. I have your new appointment time. I will send you a follow-up email with the information. We'll be glad to see you at your appointment. Now, is there anything else I can help you with?"
}
Once you have run the application above, you will end up with corresponding .mp3 files that can now be used to upload to AI Studio.
How to Add Audio Files to AI Studio
Step 1: Save the User’s Input to a Parameter
Navigate to the Collect Input node. Since I have built a self-service agent for patients of doctors and specialists to create and change appointments, I will title this node “(Existing Patient) appropriately: Set appointment date and time.” This node will gather the user information needed to set the appointment. Specifically, I am collecting the user’s preferred date and time for an appointment, and assigning it to a parameter called APPT_DATE
. This parameter will be used further downstream in the flow.
Step 2: Upload the Audio Files
Select the “Audio” Radio button. Once you do this, the view changes to a voice prompt menu.
Under the Prompt section, choose “Recording”. Now, you will add the file from your local storage to the AI Studio integrated file storage, or use an existing audio file that has already been stored on AI Studio. In my case, I already have several files that I’ve uploaded, but you can use the +Add Recording button to add a new recording.
Once you have completed that step, the modal shows that you have selected a recording to be played for the user as shown below. Additionally, a handy feature of this is that you can see the actual transcription of the file (provided by Vonage AI). This UI visualization ensures that you map the audio files appropriately to your flow nodes.
Step 3: Configure Other Nodes
You can then use the other features of this node as you normally would. For example:
Set the number of retries (the number of times the recording is played)
Set a retry prompt (for example, if you’d like to have the agent enunciate “I’m sorry, I did not hear that” as the secondary prompt).
Caller’s Response Input: Allows the designation of speech response, DTMF response, or both.
Additionally, there are some natural language refinements you might want to add so that the Agent is “trained” on your customized use-case context keywords, etc.
Be sure to hit the “Save & Exit” button, and voila! You now have customized TTS to further personalize your Agent.
Conclusion
Today, many TTS providers incorporate LLMs into their speech synthesis pipelines, allowing for more natural and expressive speech. Previously, TTS was primarily driven by methods like concatenative synthesis, which stitched together pre-recorded speech snippets, and parametric synthesis, which used statistical models such as Hidden Markov Models (HMMs). These earlier techniques often resulted in more robotic and less nuanced voices.
LLMs enhance speech synthesis by leveraging vast amounts of data and advanced neural network architectures to understand and generate human-like speech patterns. They analyze text for context, sentiment, and natural speech rhythms, allowing for more accurate intonation, stress, and emotion in the generated speech. This results in voices that are not only clearer and more pleasant to listen to but also capable of conveying subtle emotions and natural conversational dynamics, significantly improving the overall user experience.
In this blog post, we explored how to utilize 3rd party Text-to-Speech (TTS) providers in conjunction with AI Studio to fully customize the user experience. We always welcome community involvement. Please feel free to join us on GitHub and the Vonage Community Slack.
Tim is a Healthcare Customer Solutions Architect and a passionate AI/ML enthusiast, particularly in the realm of Natural Language Processing/Understanding and Knowledge Graphs. Outside of work, Tim enjoys global travel, AI research and is a competitive ballroom dancer.