Brings Voices to AgentScope #1194
Replies: 2 comments 1 reply
-
|
The layering described here is one of the more compelling parts of the post because it acknowledges that realtime voice agents are as much a systems problem as a model problem. The part that seems most important for production is not just latency, but interrupt semantics, backpressure, and how state survives partial failures during long sessions. |
Beta Was this translation helpful? Give feedback.
-
|
Excellent roadmap — the three-stage framing (TTS → Omni → Realtime) is a useful mental model for communicating production trade-offs to teams evaluating voice agents. The point raised about "Production-Grade Deployment" is worth expanding, especially for teams targeting telephony (phone calls over SIP/PSTN) as a deployment channel rather than browser-based WebSocket audio: Additional production challenges specific to telephony:
On the "Memory and Continuity" challenge — one practical pattern for telephony is to use a separate media processing layer (SIP stack / CPaaS) to handle RTP/codec/VAD, and expose a clean text-only interface to the agent. This separates the real-time audio concerns from the agent context window management. VoIPBin is an open-source CPaaS built around this model — RTP/STT/TTS handled server-side, agent logic works in pure text over REST/WebSocket. The ChatRoom multi-agent concept is genuinely interesting for conference call use cases — multiple agents participating in a call, each maintaining their own context, is a hard problem that most current frameworks do not address well. (Disclosure: I work on VoIPBin, but the telephony notes above apply regardless of which stack you use.) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
By Bingchen Qian and Dawei Gao in Jan 29, 2026
Voice interaction represents a significant shift in how we conceptualize and build agent systems. In this post, we’ll share our vision for voice agents within AgentScope, detailing our design philosophy, current progress, and the roadmap toward truly interactive, realtime AI.
Table of Contents
The Voice Agent
Voice agent is a fantastic direction. Compared with our current text-based ReAct agent, it is remarkable in user-interaction, and takes tone and attitude into the conversation. And thanks to rapid progress in voice model APIs, we can now actually build them.
Our Vision
At the core of AgentScope is a fundamental belief: an agent should be an independent entity that exists autonomously from the environment or specific applications, rather than a task-bound script. This philosophy is the North Star for our voice agents.
To realize this, we are doubling down on three key areas, even where current technology faces significant hurdles:
Roadmap
Rome Wasn't Built in a Day
Designing a voice agent is a study in managing trade-offs. While the long-term goal is seamless, human-like interaction, the reality is that different technical architectures offer vastly different levels of production-readiness.
In AgentScope, we have structured the evolution of voice in AgentScope into three distinct stages:
Each stage in this progression represents a fundamental shift in how the agent handles the information loop—moving from traditional asynchronous synthesis toward the eventual goal of bidirectional, realtime streaming.
Stage 1. Text-to-Speech
While "omni" models represent the future, Text-to-Speech (TTS) remains the most robust and production-ready foundation for voice agnets today. However, we still face the following bottlenecks in production:
For latency, we integrate realtime TTS models (such as DashScope's realtime TTS API). By feeding the LLM's text stream directly into the TTS engine as it's generated, we drastically reduce the "Time to First Sound". This ensures that the agent begins speaking almost as soon as it begins thinking ——maintaining the cadence of a natural conversation.
For context-filtering, we tested OpenAI, Gemini, and DashScope TTS APIs. Some offer basic filtering—markdown code blocks get caught, for instance. But it's inconsistent. What we need is configurable pre-processing before text hits the TTS layer. Filter code blocks. Strip metadata. Handle structured outputs intelligently.
We're building that filtering layer now (issue)
Try it yourself:
Go further: Building a multi-agent werewolf game with TTS capabilities (Source Code).
werewolf_voice_agent.mp4
Stage 2. Omni Models
Compared to TTS, omin models (such as gpt-audio, Qwen3-omni) offer an end-to-end multimodal understanding. They don't just "read" text, they perceive and generate audio natively, capturing the emotional prosody and subtle cues that are often lost in translation.
We have integrated these premier Omni models into the existing ReAct agent ecosystem. For developers, this means that moving from a text-based agent to a voice-native agent is as simple as a configuration change. Our implementation ensures that the Omni-powered ReActAgent retains its full suite of advanced capabilities: realtime interruption, high-level reasoning (planning, RAG, and tool use), and short/long-term memory.
Just initializing an Omni agent with the following snippet
Stage3. Realtime
Realtime Agent | Multi-Agent Realtime Conversation
To build a truly production-ready voice agent, we must move beyond the traditional turn-based ReAct loop. In AgentScope, the leap to Realtime is about architecting a system capable of bidirectional, continuous, and multi-agent communication.
Our goal for the Realtime stage is clear: to combine the fluidity of streaming audio with the core strengths of AgentScope—multi-agent collaboration, tool-use, and robust deployment.
The Realtime Architecture: A Three-Layer Abstraction
We have dismantled the monolithic agent structure in favor of a layered, event-driven architecture. This ensures that the agent remains responsive even while performing complex reasoning or tool-calling.
sequenceDiagram participant U as User (Browser) participant S as FastAPI Server participant A as RealtimeAgentBase participant M as RealtimeModelBase participant API as Cloud AI Provider Note over U, API: Connection Phase U->>S: WebSocket Connection S->>A: start(frontend_queue) A->>M: connect(response_queue) M->>API: Establish WebSocket & Send Session Config Note over U, API: Interaction Phase (Input) U->>S: Send Input (JSON) S->>A: handle_input(event) A->>A: _forward_loop: convert event to Block A->>M: send(AudioBlock / TextBlock) M->>API: Send API-specific Frame Note over U, API: Interaction Phase (Streaming Response) API-->>M: Raw API Message (Stream) M->>M: parse_api_message(msg) M-->>A: ModelEvents.ResponseAudioDeltaEvent A->>A: _model_response_loop: wrap with Agent Metadata A-->>S: ServerEvents.AgentResponseAudioDeltaEvent S-->>U: Send JSON via WebSocket Note over U, API: Closing Phase U->>S: WebSocket Disconnect S->>A: stop() A->>M: disconnect() M->>API: Close ConnectionScaling to Multi-Agent: The ChatRoom Concept
While a single-agent chatbot is the baseline, AgentScope’s philosophy has always centered on Multi-Agent Systems (MAS). To bring this to the voice domain, we introduced the ChatRoom—an interaction layer that acts as a central hub for broadcasting and routing messages between multiple voice agents.
The
ChatRoomhandles the complexities of a multi-way dialogue:Currently, this architecture enables sophisticated scenarios like multi-agent debates or collaborative problem-solving. While current APIs still struggle with speaker diarization (distinguishing who is speaking in a single audio stream), the
ChatRoomframework is already built to scale as those underlying capabilities mature.Go further: Multi-Agent realtime debates in English & Chinese
🤖 AI Threat v.s. AI Optimism (en)
multi_agent_realtime_voice.mp4
👥 人性本恶 v.s. 人性本善 (zh)
multi_agent_realtime_voice_zh.mp4
The Unsolved Frontiers: Memory and Continuity
Despite these architectural leaps, realtime voice agents still face significant production hurdles that we are actively working to solve:
The Path Forward
At AgentScope, we view the development of realtime Agents not as a static feature, but as a rapidly evolving frontier. The landscape of voice-native AI is shifting beneath our feet—from new breakthroughs in audio-token quantization to more robust speaker diarization models—and we are committed to staying at the forefront of these changes.
However, building the future of "Expressive Agency" is not a journey we intend to take alone.
The challenges we have outlined—from managing asynchronous state in realtime streams to architecting multi-agent "ChatRooms"—require diverse perspectives and collective experimentation. We are deeply invested in the open-source evolution of this field, and we invite the community to join us.
How to get involved:
We believe that the next few years will fundamentally redefine how humans and agents coexist. We are excited to see what you build with AgentScope, and we look forward to navigating this fluid, realtime future together.
Beta Was this translation helpful? Give feedback.
All reactions