Brings Voices to AgentScope #1194

DavdGao · 2026-02-04T07:31:30Z

DavdGao
Feb 4, 2026
Maintainer

By Bingchen Qian and Dawei Gao in Jan 29, 2026

Voice interaction represents a significant shift in how we conceptualize and build agent systems. In this post, we’ll share our vision for voice agents within AgentScope, detailing our design philosophy, current progress, and the roadmap toward truly interactive, realtime AI.

The Voice Agent

Voice agent is a fantastic direction. Compared with our current text-based ReAct agent, it is remarkable in user-interaction, and takes tone and attitude into the conversation. And thanks to rapid progress in voice model APIs, we can now actually build them.

We're actively integrating more voice APIs into AgentScope, and we welcome contributions—especially for advanced voice capabilities.

Our Vision

At the core of AgentScope is a fundamental belief: an agent should be an independent entity that exists autonomously from the environment or specific applications, rather than a task-bound script. This philosophy is the North Star for our voice agents.

To realize this, we are doubling down on three key areas, even where current technology faces significant hurdles:

Multi-agent Supports: Let' be honest: the world does not consist solely of "a user" and "an assistant"—even if that has become the industry's favorite shortcut. We hope the voice agents to collaborate with others (agents or humans), rather than being trapped in an narrow, one-on-one loop.
Environment Interaction: A voice agent shouldn't be a "speaking toy" limited to verbal exchange. Inheriting the DNA of our ReActAgent, our voice agents are designed to act. They must be able to use tools, call APIs, and influence their environment, bridging the gap between conversation and execution.
Production-Grade Deployment：The true value of a voice agent is in production. We are focused on the engineering challenges—latency, reliability, and error handling—that determine whether an agent is a curiosity or a functional tool in the real world.

Roadmap

Rome Wasn't Built in a Day

Designing a voice agent is a study in managing trade-offs. While the long-term goal is seamless, human-like interaction, the reality is that different technical architectures offer vastly different levels of production-readiness.

In AgentScope, we have structured the evolution of voice in AgentScope into three distinct stages:

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│     TTS      │ →  │  Omni Model  │ →  │   Realtime   │
│Text-to-Speech│    │  Multimodal  │    │ Realtime Omni│
└──────────────┘    └──────────────┘    └──────────────┘
 Async Synthesis      Sync Response       Bidirectional
                                           Streaming

Each stage in this progression represents a fundamental shift in how the agent handles the information loop—moving from traditional asynchronous synthesis toward the eventual goal of bidirectional, realtime streaming.

Stage 1. Text-to-Speech

While "omni" models represent the future, Text-to-Speech (TTS) remains the most robust and production-ready foundation for voice agnets today. However, we still face the following bottlenecks in production:

Latency is the obvious problem. Traditional TTS models often wait for a complete sentence or paragraph before beginning synthesis, leading to a disjointed user experience.
Content-filtering is the subtle problem. Not everything an agent generates should be spoken aloud. Code blocks? Special symobols? Generated reports? Your TTS agent shouldn't drone through a hundred lines of Python, or read out an entire document it just created.

For latency, we integrate realtime TTS models (such as DashScope's realtime TTS API). By feeding the LLM's text stream directly into the TTS engine as it's generated, we drastically reduce the "Time to First Sound". This ensures that the agent begins speaking almost as soon as it begins thinking ——maintaining the cadence of a natural conversation.

For context-filtering, we tested OpenAI, Gemini, and DashScope TTS APIs. Some offer basic filtering—markdown code blocks get caught, for instance. But it's inconsistent. What we need is configurable pre-processing before text hits the TTS layer. Filter code blocks. Strip metadata. Handle structured outputs intelligently.

We're building that filtering layer now (issue)

Try it yourself:

import asyncio
import os

from agentscope.agent import ReActAgent, UserAgent
from agentscope.formatter import DashScopeChatFormatter
from agentscope.model import DashScopeChatModel
from agentscope.tts import DashScopeRealtimeTTSModel


async def main() -> None:
    """The main entry point for the ReAct agent example."""
    agent = ReActAgent(
        name="Friday",
        sys_prompt="You are a helpful assistant named Friday.",
        model=DashScopeChatModel(
            api_key=os.environ.get("DASHSCOPE_API_KEY"),
            model_name="qwen3-max",
            stream=True,
        ),
        formatter=DashScopeChatFormatter(),
        # Specify the TTS model for realtime speech synthesis
        tts_model=DashScopeRealtimeTTSModel(
            model_name="qwen3-tts-flash-realtime",
            api_key=os.environ.get("DASHSCOPE_API_KEY"),
            voice="Cherry",
        ),
    )
    user = UserAgent("User")

    msg = None
    while True:
        msg = await user(msg)
        if msg.get_text_content() == "exit":
            break
        msg = await agent(msg)


asyncio.run(main())

Go further: Building a multi-agent werewolf game with TTS capabilities (Source Code).

werewolf_voice_agent.mp4

Stage 2. Omni Models

Compared to TTS, omin models (such as gpt-audio, Qwen3-omni) offer an end-to-end multimodal understanding. They don't just "read" text, they perceive and generate audio natively, capturing the emotional prosody and subtle cues that are often lost in translation.

We have integrated these premier Omni models into the existing ReAct agent ecosystem. For developers, this means that moving from a text-based agent to a voice-native agent is as simple as a configuration change. Our implementation ensures that the Omni-powered ReActAgent retains its full suite of advanced capabilities: realtime interruption, high-level reasoning (planning, RAG, and tool use), and short/long-term memory.

Just initializing an Omni agent with the following snippet

import asyncio
import os

from agentscope.agent import ReActAgent, UserAgent
from agentscope.formatter import OpenAIChatFormatter
from agentscope.memory import InMemoryMemory
from agentscope.model import OpenAIChatModel


async def main() -> None:
    """The main entry point for the ReAct audio agent example."""
    agent = ReActAgent(
        name="Friday",
        sys_prompt="You are a helpful assistant",
        model=OpenAIChatModel(
            model_name="qwen3-omni-flash",
            client_kwargs={
                "base_url": "https://dashscope.aliyuncs.com/"
                "compatible-mode/v1",
            },
            api_key=os.getenv("DASHSCOPE_API_KEY"),
            stream=True,
            generate_kwargs={
                "modalities": ["text", "audio"],
                "audio": {"voice": "Cherry", "format": "wav"},
            },
        ),
        formatter=OpenAIChatFormatter(),
    )
    user = UserAgent("Bob")

    msg = None
    while True:
        msg = await user(msg)
        if msg.get_text_content() == "exit":
            break
        msg = await agent(msg)


asyncio.run(main())

Stage3. Realtime

Realtime Agent | Multi-Agent Realtime Conversation

To build a truly production-ready voice agent, we must move beyond the traditional turn-based ReAct loop. In AgentScope, the leap to Realtime is about architecting a system capable of bidirectional, continuous, and multi-agent communication.

Our goal for the Realtime stage is clear: to combine the fluidity of streaming audio with the core strengths of AgentScope—multi-agent collaboration, tool-use, and robust deployment.

The Realtime Architecture: A Three-Layer Abstraction

We have dismantled the monolithic agent structure in favor of a layered, event-driven architecture. This ensures that the agent remains responsive even while performing complex reasoning or tool-calling.

The Model Layer (RealtimeModel): This layer acts as a protocol adapter. It provides a unified abstraction, the ModelEvent, which normalizes different Realtime APIs (OpenAI, DashScope, Gemini) into a consistent schema.
The Agent Layer (RealtimeAgent): Unlike text-based agents, this layer operates on a bidirectional asynchronous architecture. It simultaneously manages incoming events (storing them in memory and feeding the API) and outgoing responses (parsing API output to trigger tools or update the UI).
The Deployment Layer: Using FastAPI and WebSockets, this layer bridges the gap between the agent’s internal logic and the end-user, managing the continuous flow of JSON and binary frames.

┌───────────────────────────────────────────────────────────┐
│       Deployment Layer (FastAPI / WebSocket)              │
│   (Manages User Conn:  Frontend JSON ↔ Server Events)     │
└───────────────┬───────────────────────────▲───────────────┘
          ClientEvents (JSON)         ServerEvents (JSON)
                ▼                           │
┌───────────────────────────────────────────┴───────────────┐
│               Agent Layer (RealtimeAgent)                 │
│      (Agent Logic: Event Handling ↔ State Management)     │
└───────────────┬───────────────────────────▲───────────────┘
          Data Blocks                Model Events
       (Audio/Text/Image)          (Unified Schema)
                ▼                           │
┌───────────────────────────────────────────┴───────────────┐
│               Model Layer (RealtimeModel)                 │
│   (Protocol Adapter: Request Encoding ↔ API Parsing)      │
└───────────────┬───────────────────────────▲───────────────┘
          Binary/JSON Frames          Binary/JSON Frames
                ▼                           │
┌───────────────────────────────────────────┴───────────────┐
│     Cloud AI Provider (DashScope / OpenAI / Gemini)       │
└───────────────────────────────────────────────────────────┘

sequenceDiagram
    participant U as User (Browser)
    participant S as FastAPI Server
    participant A as RealtimeAgentBase
    participant M as RealtimeModelBase
    participant API as Cloud AI Provider

    Note over U, API: Connection Phase
    U->>S: WebSocket Connection
    S->>A: start(frontend_queue)
    A->>M: connect(response_queue)
    M->>API: Establish WebSocket & Send Session Config

    Note over U, API: Interaction Phase (Input)
    U->>S: Send Input (JSON)
    S->>A: handle_input(event)
    A->>A: _forward_loop: convert event to Block
    A->>M: send(AudioBlock / TextBlock)
    M->>API: Send API-specific Frame

    Note over U, API: Interaction Phase (Streaming Response)
    API-->>M: Raw API Message (Stream)
    M->>M: parse_api_message(msg)
    M-->>A: ModelEvents.ResponseAudioDeltaEvent
    A->>A: _model_response_loop: wrap with Agent Metadata
    A-->>S: ServerEvents.AgentResponseAudioDeltaEvent
    S-->>U: Send JSON via WebSocket

    Note over U, API: Closing Phase
    U->>S: WebSocket Disconnect
    S->>A: stop()
    A->>M: disconnect()
    M->>API: Close Connection

Scaling to Multi-Agent: The ChatRoom Concept

While a single-agent chatbot is the baseline, AgentScope’s philosophy has always centered on Multi-Agent Systems (MAS). To bring this to the voice domain, we introduced the ChatRoom—an interaction layer that acts as a central hub for broadcasting and routing messages between multiple voice agents.

The ChatRoom handles the complexities of a multi-way dialogue:

Broadcasting: Every event from an agent is routed both to the frontend and to other agents in the room.
Parallel Sessions: Each agent maintains its own independent WebSocket session with the AI provider, while the ChatRoom orchestrates their collective behavior.

┌───────────────────────────────────────────────────────────────┐
│              Deployment Layer (FastAPI / WebSocket)           │
│         (Manages Single User Connection to the Room)          │
└───────────────────────────────┬───────────────────────────────┘
            ClientEvents (User Input) │ ▲ ServerEvents (All Agents)
                                      ▼ │
┌───────────────────────────────────────────────────────────────┐
│                Interaction Layer (ChatRoom)                   │
│    (Central Hub: Broadcaster & Multi-Agent Message Router)    │
│  ┌─────────────────────────────────────────────────────────┐  │
│  │  _forward_loop: Route Msg to Frontend & Other Agents    │  │
│  └───────────────┬─────────────────────────────┬───────────┘  │
└──────────────────┼─────────────────────────────┼──────────────┘
                   │                             │
        ┌──────────▼──────────┐       ┌──────────▼──────────┐
        │       Agent 1       │       │       Agent 2       │
        │   (RealtimeAgent)   │       │    (RealtimeAgent)  │
        └──────────┬──────────┘       └──────────┬──────────┘
                   ▼                             ▼
        ┌──────────┴──────────┐       ┌──────────┴──────────┐
        │       Model 1       │       │       Model 2       │
        │   (RealtimeModel)   │       │   (RealtimeModel)   │
        └──────────┬──────────┘       └──────────┬──────────┘
                   ▼                             ▼
┌───────────────────────────────────────────────────────────────┐
│           Cloud AI Provider (DashScope / OpenAI)              │
│       (Independent WebSocket Session for Each Agent)          │
└───────────────────────────────────────────────────────────────┘

Currently, this architecture enables sophisticated scenarios like multi-agent debates or collaborative problem-solving. While current APIs still struggle with speaker diarization (distinguishing who is speaking in a single audio stream), the ChatRoom framework is already built to scale as those underlying capabilities mature.

Go further: Multi-Agent realtime debates in English & Chinese

🤖 AI Threat v.s. AI Optimism (en)

multi_agent_realtime_voice.mp4

👥 人性本恶 v.s. 人性本善 (zh)

multi_agent_realtime_voice_zh.mp4

The Unsolved Frontiers: Memory and Continuity

Despite these architectural leaps, realtime voice agents still face significant production hurdles that we are actively working to solve:

Contextual Memory Management: Most realtime APIs have limited context windows for audio tokens. To maintain continuity, we implement a "compaction" strategy—summarizing the current state into text to re-initialize the connection without losing the agent’s "train of thought".
Session Persistence: WebSocket connections are inherently ephemeral and often have maximum duration limits. Designing seamless handovers for long-running conversations is a critical focus for our next iteration.

The Path Forward

At AgentScope, we view the development of realtime Agents not as a static feature, but as a rapidly evolving frontier. The landscape of voice-native AI is shifting beneath our feet—from new breakthroughs in audio-token quantization to more robust speaker diarization models—and we are committed to staying at the forefront of these changes.

However, building the future of "Expressive Agency" is not a journey we intend to take alone.

The challenges we have outlined—from managing asynchronous state in realtime streams to architecting multi-agent "ChatRooms"—require diverse perspectives and collective experimentation. We are deeply invested in the open-source evolution of this field, and we invite the community to join us.

How to get involved:

Contribute: Whether it’s integrating a new Omni model API, optimizing the RealtimeModel abstraction, or refining our streaming filters, your pull requests drive the framework forward.
Discuss: Join our community forums to share your use cases and help us define the next generation of voice-interaction patterns.
Experiment: Build your own multi-agent voice scenarios using our current Stage 1 and Stage 2 implementations and let us know what’s missing.

We believe that the next few years will fundamentally redefine how humans and agents coexist. We are excited to see what you build with AgentScope, and we look forward to navigating this fluid, realtime future together.

aniruddhaadak80 · 2026-03-10T05:44:07Z

aniruddhaadak80
Mar 10, 2026

The layering described here is one of the more compelling parts of the post because it acknowledges that realtime voice agents are as much a systems problem as a model problem. The part that seems most important for production is not just latency, but interrupt semantics, backpressure, and how state survives partial failures during long sessions.nnIf those contracts become explicit in the framework, people will have a much easier time building on top of the realtime pieces without reverse engineering behavior from examples.

0 replies

pchero · 2026-04-04T11:12:04Z

pchero
Apr 4, 2026

Excellent roadmap — the three-stage framing (TTS → Omni → Realtime) is a useful mental model for communicating production trade-offs to teams evaluating voice agents.

The point raised about "Production-Grade Deployment" is worth expanding, especially for teams targeting telephony (phone calls over SIP/PSTN) as a deployment channel rather than browser-based WebSocket audio:

Additional production challenges specific to telephony:

Codec negotiation: SIP calls use G.711 μ-law/a-law (8 kHz, 64 kbps) by default. Your realtime audio pipeline needs to either accept ulaw natively or transcode efficiently — this is where most voice agents first encounter unexpected latency spikes.
RTP jitter buffers: Unlike WebSocket, RTP packets arrive out-of-order over UDP. Without a jitter buffer, audio becomes choppy. The buffer size directly affects E2E latency (50ms = tight but unstable; 150ms = stable but noticeable).
DTMF + VAD: Telephone VAD has different characteristics than browser microphone VAD. In-band DTMF (RFC 2833) must be handled separately from speech.
Session continuity: SIP re-INVITEs, hold/resume flows, and early media complicate session state management beyond what browser WebSocket agents typically encounter.

On the "Memory and Continuity" challenge — one practical pattern for telephony is to use a separate media processing layer (SIP stack / CPaaS) to handle RTP/codec/VAD, and expose a clean text-only interface to the agent. This separates the real-time audio concerns from the agent context window management. VoIPBin is an open-source CPaaS built around this model — RTP/STT/TTS handled server-side, agent logic works in pure text over REST/WebSocket.

The ChatRoom multi-agent concept is genuinely interesting for conference call use cases — multiple agents participating in a call, each maintaining their own context, is a hard problem that most current frameworks do not address well.

(Disclosure: I work on VoIPBin, but the telephony notes above apply regardless of which stack you use.)

1 reply

DavdGao Apr 21, 2026
Maintainer Author

Excellent roadmap — the three-stage framing (TTS → Omni → Realtime) is a useful mental model for communicating production trade-offs to teams evaluating voice agents.

The point raised about "Production-Grade Deployment" is worth expanding, especially for teams targeting telephony (phone calls over SIP/PSTN) as a deployment channel rather than browser-based WebSocket audio:

Additional production challenges specific to telephony:

Codec negotiation: SIP calls use G.711 μ-law/a-law (8 kHz, 64 kbps) by default. Your realtime audio pipeline needs to either accept ulaw natively or transcode efficiently — this is where most voice agents first encounter unexpected latency spikes.

RTP jitter buffers: Unlike WebSocket, RTP packets arrive out-of-order over UDP. Without a jitter buffer, audio becomes choppy. The buffer size directly affects E2E latency (50ms = tight but unstable; 150ms = stable but noticeable).

DTMF + VAD: Telephone VAD has different characteristics than browser microphone VAD. In-band DTMF (RFC 2833) must be handled separately from speech.

Session continuity: SIP re-INVITEs, hold/resume flows, and early media complicate session state management beyond what browser WebSocket agents typically encounter.

On the "Memory and Continuity" challenge — one practical pattern for telephony is to use a separate media processing layer (SIP stack / CPaaS) to handle RTP/codec/VAD, and expose a clean text-only interface to the agent. This separates the real-time audio concerns from the agent context window management. VoIPBin is an open-source CPaaS built around this model — RTP/STT/TTS handled server-side, agent logic works in pure text over REST/WebSocket.

The ChatRoom multi-agent concept is genuinely interesting for conference call use cases — multiple agents participating in a call, each maintaining their own context, is a hard problem that most current frameworks do not address well.

(Disclosure: I work on VoIPBin, but the telephony notes above apply regardless of which stack you use.)

Thanks for the detailed notes on real-time codec latency — we're still learning the telephony side and hadn't gone deep into this area before, so this is genuinely helpful. We'll dig into how to handle this properly.

On the 'Memory and Continuity' point — I'm a bit confused about the text-only interface approach. If the agent only receives text, doesn't that essentially bring us back to the traditional text-based agent paradigm? How does real-time voice interaction actually work in that case? Is the idea that the media layer handles the 'realtime' part (e.g., interruption detection, VAD), while the agent itself operates on text — and then TTS converts the response back to audio? So it's more of a pipeline coordination approach (LLM + TTS + interruption logic) rather than a natively voice-aware agent?"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brings Voices to AgentScope #1194

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Brings Voices to AgentScope #1194

Uh oh!

Uh oh!

DavdGao Feb 4, 2026 Maintainer

Table of Contents

The Voice Agent

Our Vision

Roadmap

Stage 1. Text-to-Speech

Stage 2. Omni Models

Stage3. Realtime

The Realtime Architecture: A Three-Layer Abstraction

Scaling to Multi-Agent: The ChatRoom Concept

The Unsolved Frontiers: Memory and Continuity

The Path Forward

Replies: 2 comments · 1 reply

Uh oh!

aniruddhaadak80 Mar 10, 2026

Uh oh!

pchero Apr 4, 2026

Uh oh!

DavdGao Apr 21, 2026 Maintainer Author

DavdGao
Feb 4, 2026
Maintainer

Replies: 2 comments 1 reply

aniruddhaadak80
Mar 10, 2026

pchero
Apr 4, 2026

DavdGao Apr 21, 2026
Maintainer Author