Skip to content

Feature: Bidirectional realtime voice (WebRTC + OpenAI Realtime API) #1

@MervinPraison

Description

@MervinPraison

Overview

PraisonAI's praisonai realtime command currently launches a Chainlit-based voice UI that uses cl.on_audio_chunk + OpenAI's Realtime API (RealtimeClient) for bidirectional speech-to-speech. To complete the Chainlit → PraisonAIUI migration (MervinPraison/PraisonAI#1443), aiui needs a first-class Realtime voice feature with WebRTC-style bidirectional audio.

Current aiui state

  • praisonaiui/features/tts.py — one-way TTS only (OpenAI TTS + browser Web Speech API).
  • No microphone input, no realtime audio streaming, no WebRTC.

Requested feature

A new feature analogous to tts.py — protocol-driven, lazy-loaded — that supports:

  1. Bidirectional audio: mic capture in the browser → streamed to backend → forwarded to OpenAI Realtime API (gpt-4o-realtime-preview) → audio chunks back to browser → playback.
  2. WebSocket or WebRTC transport (WebRTC preferred; OpenAI now supports direct ephemeral-token WebRTC sessions).
  3. Protocol: RealtimeProtocol ABC with create_session(), send_audio(), receive_audio(), matching the pattern of TTSProtocol.
  4. Backends: OpenAIRealtimeManager (primary), room for Anthropic/ElevenLabs later.
  5. Message integration: realtime transcripts surface as normal RunEvent.RUN_CONTENT so existing Chat/Agents/Dashboard pages show the transcript in realtime.
  6. Tool-call support: OpenAI Realtime can call tools; events must bridge to RunEventType.TOOL_CALL_STARTED/COMPLETED.
  7. Dashboard page: realtime sidebar page with mic button, waveform, transcript panel.

Why this blocks PraisonAI

Without this, praisonai realtime has no aiui replacement and we cannot drop chainlit from praisonai[realtime] / praisonai[all]. See MervinPraison/PraisonAI#1443 Phase 1.

Acceptance criteria

  • aiui.set_realtime(OpenAIRealtimeManager()) wires realtime voice
  • @aiui.realtime decorator (or similar) to customise per-session behaviour
  • realtime built-in dashboard page renders mic + playback UI
  • Works with ephemeral OpenAI tokens (no API key in browser)
  • Integrates with BaseProvider so Core SDK agents can be the realtime brain
  • Unit tests for protocol conformance + session lifecycle

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions