Building a Real Time .NET Transcription Service
Published on May 18, 2021

Building speech to text transcription services has never been easier. In this demo you will be building an extremely simple, yet powerful, real-time transcription service in ASP.NET using the Vonage .NET SDK and Microsoft Azure's Speech SDK.

Prerequisites

  • Visual Studio 2019 version 16.3 or higher

  • An Azure Account

  • Optional: Ngrok for test deployment

DT API Account

To complete this tutorial, you will need a DT API account. If you don’t have one already, you can sign up today and start building with free credit. Once you have an account, you can find your API Key and API Secret at the top of the DT API Dashboard.

This tutorial also uses a virtual phone number. To purchase one, go to Numbers > Buy Numbers and search for one that meets your needs.

Create Azure Cognitive Services Resource

  • Go to your Azure Dashboard

  • Open the Hamburger menu and select "Create Resource"

  • Search for Speech and Create a Speech Resource:

Speech Resource pageSpeech Resource page

  • Fill out the create form. For demonstrative purposes, you can use the following inputs

    • Name: TranscriptionTest

    • Subscription: pay-as-you-go

    • Location: East US

    • Pricing Tier: F0

    • Resource Group: Transcription

This will take some time to spin up. Once it's deployed you are going to navigate to the Quick Start Section to gather your access keys. You are looking for the Key1 value, so search for the highlighted section:

Quick Start SectionQuick Start Section

Save this key value offline somewhere for the moment.

Building the Real Time Transcription Service

Project Setup

Open Visual Studio and create a new project ASP.NET Core WebApplication. For example, "TranscriptionBlogPostDemo"

Now, you'll create an MVC web application for this demonstration in ASP.NET Core 3.0.

Web Application Type SelectionWeb Application Type Selection

With this created, add the following nuget packages to the project:

  • Nexmo.Csharp.Client

  • Newtonsoft.Json

  • Microsoft.CognitiveServices.Speech

Transcription Engine

With these projects imported, create a new class called TranscriptionEngine.

First, define some constants for the benefit of both the speech SDK and for managing the WebSocket buffer.

const int SAMPLES_PER_SECOND = 8000;
const int BITS_PER_SAMPLE = 16;
const int NUMBER_OF_CHANNELS = 1;
const int BUFFER_SIZE = 160 * 2;

Next, add the following field to the class:

  • _config - this will hold the subscription/regional info of the speech analyzer. The region in the demo is eastus - derived from the region you configured your speech service for. For a mapping of region to input string see the Microsoft Azure Speech Service Supported Regions Documentation

  • _inputStream - this is going to be a push Stream that will serve as the buffer that will be streamed over to the Azure Speech-to-text service

  • _audioInput - this will be the input for the speech recognizer

  • _recognizer - this will be the recognizer that will perform the speech recognition task

These can be defined like so:

SpeechConfig _config = SpeechConfig.FromSubscription("your_subscription_key", "your_azure_region"); // e.g. eastus
PushAudioInputStream _inputStream = AudioInputStream.CreatePushStream(AudioStreamFormat.GetWaveFormatPCM(SAMPLES_PER_SECOND, BITS_PER_SAMPLE, NUMBER_OF_CHANNELS));
AudioConfig _audioInput;
SpeechRecognizer _recognizer;

Given several of the fields are IDisposable's, have this class implement IDisposable and simply dispose of all the disposable fields on its way down

public void Dispose()
{
    _inputStream.Dispose();
    _audioInput.Dispose();
    _recognizer.Dispose();
}

Then add a constructor that will initialize the _audioInput with the push input stream defined above:

public TranscriptionEngine()
{
    _audioInput = AudioConfig.FromStreamInput(_inputStream);
}

Next, add the method that will listen for speech recognition events from the recognizer

private void RecognizerRecognized(object sender, SpeechRecognitionEventArgs e)
{
    Trace.WriteLine("Recognized: " + e.Result.Text);
}

From here, you can add a function to stop and start the speech recognizer.

The start method accepts a language string, sets the language of the SpeechConfig, initializes the recognizer with the config and audio input source, registers the RecognizerRecognized event you created earlier, and starts a Continuous Recognition.

Your stop method will unregister the RecognizerRecognized event and stop the recognizer.

NOTE: The StopContinuousRecognitionAsync can take upwards of 20 seconds as there isn't a mechanism to cancel the currently running input stream as of this writing. This demo explicitly mitigates this issue by not reusing the recognizer between calls, and not blocking the shutdown of the socket for this to complete.

public async Task StartSpeechTranscriptionEngine(string language)
{
    _config.SpeechRecognitionLanguage = language;
    _recognizer = new SpeechRecognizer(_config, _audioInput);
    _recognizer.Recognized += RecognizerRecognized;
    await _recognizer.StartContinuousRecognitionAsync();
}

private async Task StopTranscriptionEngine()
{
    if(_recognizer != null)
    {
        _recognizer.Recognized -= RecognizerRecognized;
        await _recognizer.StopContinuousRecognitionAsync();
    }
}

The final task this class is going to take on is going to be to receive audio on the websocket you will be setting up, and pushing it through to the PushAudioStream you created earlier. This will be awaited after the websocket is established and will continue to be until the websocket is closed.

public async Task ReceiveAudioOnWebSocket(HttpContext context, WebSocket webSocket)
{
    var buffer = new byte[BUFFER_SIZE];

    try
    {
        var language = "en-US";
        await StartSpeechTranscriptionEngine(language);
        WebSocketReceiveResult result = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
        while (!result.CloseStatus.HasValue)
        {
            await webSocket.SendAsync(new ArraySegment<byte>(buffer, 0, result.Count), result.MessageType, result.EndOfMessage, CancellationToken.None);

            result = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);

            _inputStream.Write(buffer);
        }
        await webSocket.CloseAsync(result.CloseStatus.Value, result.CloseStatusDescription, CancellationToken.None);
    }
    catch (Exception e)
    {
        Trace.WriteLine(e.ToString());
    }
    finally
    {
        await StopTranscriptionEngine();
    }
}
</byte></byte></byte>

NOTE: The initial buffer you get back from the websocket will contain metadata for the call—and should you wish, you may extract this data from the first ReceiveAsync—for the sake of the demo, this is not done, as the buffer and Recognizer are robust enough to manage.

Setting up WebSockets in Your App

Open Startup.cs.

In the Configure method, you will enable websockets on the server and provide a piece of websocket middleware to use websockets, and to connect an inbound websocket and use the TranscriptionEngine to receive audio over said socket.

var webSocketOptions = new WebSocketOptions()
{
    KeepAliveInterval = TimeSpan.FromSeconds(120),
    ReceiveBufferSize = 320
};

app.UseWebSockets(webSocketOptions);

app.Use(async (context, next) =>
{
    if (context.Request.Path == "/ws")
    {
        if (context.WebSockets.IsWebSocketRequest)
        {
            WebSocket webSocket = await context.WebSockets.AcceptWebSocketAsync();
            using (var engine = new TranscriptionEngine())
            {
                await engine.ReceiveAudioOnWebSocket(context, webSocket);
            }
        }
        else
        {
            context.Response.StatusCode = 400;
        }
    }
    else
    {
        await next();
    }
});

Setting up the Voice Controller

The last piece of code that needs to be implemented is to add a Voice Controller. Add a new controller under the Controller file and name it VoiceController. Add a constant string for the BASE_URL of your service.

const string BASE_URL = "BASE_URL";

Your Voice Controller will have two HTTP requests. A POST request for the event webhook, and a GET request the answer webhook. This GET request is going to construct an NCCO with a single connect action which will instruct the Voice API to open a WebSocket to your server and push the audio stream back over that socket. Set the Content type to a 16 bit linear PCM running at 8kHz. See below:

[HttpPost]
public HttpStatusCode Events()
{
    return HttpStatusCode.OK;
}

[HttpGet]
public string Answer()
{
    var webSocketAction = new ConnectAction()
    {
        Endpoint = new[]
        {
            new WebsocketEndpoint()
            {
                Uri = $"wss://{BASE_URL}/ws",
                ContentType="audio/l16;rate=8000",

            }
        }
    };

    var ncco = new Ncco(webSocketAction);
    return ncco.ToString();
}

Getting Up and Running

Setting up IIS Express

Open the properties dialog for your project, under debug take note of the port number - for the demo, disable SSL, which will make setting up ngrok easier.

Setting up the Ngrok Tunnel

For the Voice API to forward the the Event/Answer webhooks you need to expose the site to the internet - for testing purposes, you can use ngrok to expose our IIS express port. Open up your command line and use this command, replace PORT_NUMBER with your IIS Express instance's port number.

ngrok http --host-header="localhost:PORT_NUMBER" http://localhost:PORT_NUMBER

This command produces an output like this:

ngrok configngrok config

Setting up the Vonage Voice Application

The next step is going to be setting up a Vonage Voice Application.

  • Navigate to your Vonage Dashboard

  • In the left-hand pane open Voice, and click on 'Create an application'

  • Name the application, e.g. 'TranscriptionTest'

  • Under Capabilities enable Voice

  • For Event URL add base_url_of_ngrok_tunnel/voice/events

  • For Answer URL add base_url_of_ngrok_tunnel/voice/answer

  • For Fallback answer URL add base_url_of_ngrok_tunnel/voice/answer

Final Touches

Now that you have the ngrok URL, change the BASE_URL in the VoiceController file to that url (excluding the 'http://')

With this, you are up and running. Call the Vonage Number linked to your application, and the App will transcribe your speech to the debug console.

Steve LorelloVonage Alumni

Former .NET Developer Advocate @Vonage, full-stack polyglottic Software Engineer, AI/ML Grad Student, avid runner, and an even more avid traveler.

Ready to start building?

Experience seamless connectivity, real-time messaging, and crystal-clear voice and video calls-all at your fingertips.