Video Connector Pipecat Integration
The Vonage Video Connector transport for Pipecat enables you to build applications that seamlessly participate in Vonage Video API sessions. This transport allows you to receive audio and video from session participants and send processed audio and video back to the session in real-time.
Pipecat is a framework for building voice and multimodal conversational AI applications. The Vonage Video Connector transport bridges Pipecat's media processing pipeline with Vonage Video API sessions, enabling a wide range of use cases:
- Real-time voice and video AI assistants
- Live transcription and translation services
- Call recording and analysis
- Audio and video effects processing
- Automated moderation and content filtering
- Custom media processing and manipulation
The transport handles audio and video format conversion, session management, and WebRTC connectivity, allowing you to focus on building your application logic.
This page includes the following sections:
- Private beta
- Requirements
- Basic setup
- Transport configuration
- Audio and video handling
- Stream subscription
- Session management
- Pipeline integration
- Best practices
Private Beta
The Vonage Video Connector Pipecat integration is in beta stage. Contact us to get early access.
Requirements
In order to use this transport you will need the Vonage Video Connector python library, which runs on Linux AMD64 and ARM64 platforms.
Basic Setup
Authentication and session parameters
To use the Vonage Video Connector transport, you need:
- Application ID - Your Vonage Video API application identifier
- Session ID - The ID of the Video API session you want to join
- Token - A valid participant token for the session
These parameters can be obtained from your Vonage Video API dashboard or generated using the Vonage Video API server SDKs.
Initialize the transport
Transport Configuration
Basic audio and video parameters
Configure the transport for your specific audio and video requirements:
Voice Activity Detection (VAD)
It is advisable to use this transport with Voice Activity Detection to optimize audio processing:
VAD helps reduce unnecessary processing by detecting when speech is present in the audio stream.
Buffer clearing on interruptions
The clear_buffers_on_interruption parameter determines whether media buffers are
automatically cleared when an interruption frame is received in the pipeline.
When to enable (True, default):
- Conversational AI applications where you want to stop playback immediately when the user interrupts
- Interactive voice assistants that need to respond quickly to user input
- Applications where outdated audio/video should be discarded to maintain real-time interaction
- Scenarios where minimizing latency is more important than completing media playback
When to disable (False):
- Recording or streaming applications where you want to preserve all media
- Applications that need to complete playback of important information even if interrupted
- Batch processing scenarios where media should be processed sequentially without interruption
- Use cases where you're implementing custom interruption handling logic
Audio and Video Handling
Audio and video input processing
The transport automatically converts incoming audio and video from the Vonage Video session to Pipecat's internal media formats:
Audio and video output generation
Send audio and video back to the Vonage Video session:
# Audio and video output is sent automatically through the pipeline
pipeline = Pipeline([
# ... your AI processing pipeline
transport.output(), # Sends audio and video to Vonage session
])
Stream Subscription
When the transport subscribes to streams from session participants, it generates Pipecat frames that your pipeline can process. The behavior differs between audio and video:
Video streams: The transport generates individual video frames for each subscribed stream, identified by the stream ID. This allows you to process video from different participants separately in your pipeline.
Audio streams: The transport currently generates audio frames with all subscribed audio streams mixed together. All participant audio is combined into a single audio stream that your pipeline receives.
By default, the transport automatically subscribes to streams based on the audio_in_auto_subscribe
and video_in_auto_subscribe parameters. You can also manually control which streams to subscribe
to for more fine-grained control.
Manual stream subscription
If you need more control over which streams to subscribe to, you can disable auto-subscription and manually subscribe to specific streams:
When to use automatic subscription (audio_in_auto_subscribe=True, video_in_auto_subscribe=True, default):
- Simple applications where you want to receive all streams from all participants
- Voice or video assistants that need to interact with everyone in the session
- Recording or monitoring applications that should capture all participants
- Use cases where minimizing code complexity is more important than selective subscription
- Applications where all participants should be treated equally
When to use manual subscription (audio_in_auto_subscribe=False, video_in_auto_subscribe=False):
- Applications that need to selectively subscribe based on participant metadata or session logic
- Scenarios where you want to optimize bandwidth by subscribing only to specific streams
- Use cases requiring custom subscription settings per participant (different quality levels)
- Applications that need to validate or authenticate participants before subscribing
- Complex multi-party scenarios where you want fine-grained control over which streams to receive
Controlling video quality with simulcast
When subscribing to video streams, you can control the video quality you receive using the
preferred_resolution and preferred_framerate parameters. These parameters are particularly
useful when the publisher is sending simulcast streams (multiple quality layers).
Simulcast streams contain multiple spatial and temporal layers:
- Spatial layers: Different resolutions (e.g., 1280x720, 640x480, 320x240)
- Temporal layers: Different framerates (e.g., 30fps, 15fps, 7.5fps)
By specifying your preferred resolution and framerate, you can optimize bandwidth usage and processing requirements:
Important notes:
- If the publisher doesn't support simulcast or the requested layer isn't available, the server will provide the closest available quality
- Lower resolutions and framerates reduce bandwidth consumption and processing overhead
- These settings only affect video subscription; they don't control the publisher's output quality
- You can also set global preferences using
video_in_preferred_resolutionandvideo_in_preferred_frameratein the transport parameters for auto-subscribed streams
Session Management
Session lifecycle events
Handle session join and leave events:
Participant events
Monitor participants joining and leaving the session:
Client connection events
Monitor individual stream subscriber connections:
Pipeline Integration
Complete pipeline example
Here's how to integrate the Vonage Video Connector transport with a complete AI pipeline:
Best Practices
Performance optimization
Choose appropriate sample rates:
- Use 16kHz for speech recognition and most AI services
- Use 24kHz or higher for better text-to-speech quality
- Avoid unnecessary high sample rates that increase processing load
Optimize pipeline processing:
- Keep AI processing pipelines efficient to minimize latency
- Use appropriate frame sizes and buffer management
- Consider using VAD to reduce unnecessary processing
Debugging and monitoring
Enable logging: