AgentSteerTTS

Multi-Agent Closed-Loop Steering Framework

A Multi-Agent Closed-Loop Steering Framework for Intent-Consistent Expressive Text-to-Speech

Bin Kang1,2,3, Shaoguo Wen3, Yang Fan2, Shunlong Wu4, Junjie Wang2, Yulin Li2, Junzhi Zhao5, Junle Wang3, Zhuotao Tian2,*
1University of Chinese Academy of Sciences, Beijing, China
2Shenzhen Loop Area Institute, Shenzhen, China
3Tencent Turinglab, Shenzhen, China
4Tsinghua University, Beijing, China
5Southwest Jiaotong University, Chengdu, China
*Corresponding author
ICML 2026 Project Page

Abstract

While current TTS models can be highly expressive, precise control over fine-grained composite instructions remains difficult due to the structural mismatch between discrete textual intents and continuous acoustic realizations. Inspired by human cognitive decoupling, we propose AgentSteerTTS, a multi-agent, closed-loop framework for intent-faithful emotional steering.

We first introduce an adversarial disentanglement agent to mitigate speaker-emotion leakage by learning separable identity and emotion-prosody subspaces with leakage-suppressing regularization. Subsequently, we introduce a Dual-Stream Anchoring Controller, where a Retrieval Agent leverages a large-scale constructed library of fine-grained acoustic prototypes to ground abstract intent, while a Synthesis Agent transforms these cues into continuous control vectors through gated fusion. Finally, we implement a fast-slow feedback mechanism where a Fast Control Agent performs rapid latent gradient correction for intensity calibration, and a Supervisor Agent provides high-level perceptual critique to rectify semantic-acoustic discrepancies.

Method Overview

Framework Overview
Figure 1: AgentSteerTTS Framework. Our method consists of four key components: (1) Adversarial Disentanglement Module (ADM) that reduces speaker-emotion leakage by factorizing identity and emotion-prosody representations; (2) Retrieval Agent that retrieves high-expressivity prototypes from a large-scale acoustic library with perceptual pruning; (3) Synthesis Agent that transforms discrete intents into continuous control vectors via gated fusion mechanism; (4) Fast-Slow Feedback mechanism for hierarchical semantic-acoustic alignment correction.

Research Motivation

Our pilot study reveals that existing TTS models exhibit systematic biases when handling composite emotional instructions, failing to faithfully express multi-dimensional emotional intents.

Composite Emotion Control Analysis

Composite Emotion Radar
Figure 2: Semantic-Acoustic Misalignment in Existing Methods. When given composite instructions (e.g., "Happy but slightly Arrogant"), state-of-the-art models show significant deviation from target emotion profiles:
  • Target Suppression: Target emotion expression drops 25%-45%
  • Energy Leakage: +0.08 average leakage to irrelevant dimensions
  • Joint Satisfaction: Only ~30% for composite instructions

Audio Demonstrations

We present side-by-side audio samples that make the comparison immediate: same content and speaker condition, different emotional controllability.

1. Composite control

Fine-grained mixed instructions test whether a model can preserve both dominant and secondary emotional cues.

Core task

2. Single emotions

The new ICML asset bundle compares neutral and five basic emotions across IndexTTS2, CosyVoice2, and ours.

New samples

3. Intensity steering

Low, medium, and high intensity samples demonstrate controllable acoustic strength under fixed semantics.

Calibration

Part 1: Composite Emotion Control

Complex instructions combining multiple emotional attributes - our core contribution.

Reference Audio Composite Instruction Text Content IndexTTS2 CosyVoice2 AgentSteerTTS (Ours)
Speaker Reference
Happy but slightly Arrogant 我早就知道会是这个结果,这种事对我来说太简单了!
Sad but with a hint of Hope 虽然现在很难过,但我相信明天会更好的。
Angry but Restrained 我很生气,但我会控制好自己的情绪,你给我记住。
Surprised and slightly Fearful 那个声音是什么?突然吓了我一跳!
Disgusted with underlying Anger 这种行为真是太让人恶心了,我无法容忍!
Happy and Excited 天哪,我中奖了!太不可思议了,太开心了!

Part 2: Single Emotion Benchmark Set

Each row uses the same transcribed text prompt across IndexTTS2, CosyVoice2, and AgentSteerTTS, making the emotional delivery differences easier to compare.
Shared speaker reference
科技的发展日新月异,人工智能正在改变我们的生活方式。从智能手机到自动驾驶,每一项创新都让世界变得更加便捷和高效。
Sad Negative
她默默地站在窗前,看着窗外的细雨,回想起那些再也回不去的美好时光,眼眶不禁湿润了。
IndexTTS2
CosyVoice2
AgentSteerTTS
Angry High arousal
你怎么能这样做?我已经反复强调过很多次了,你根本就没有把我的话当回事。
IndexTTS2
CosyVoice2
AgentSteerTTS
Fearful Tense
不要过来,求求你不要过来,我听到门外有奇怪的脚步声,越来越近了。
IndexTTS2
CosyVoice2
AgentSteerTTS
Surprised Reactive
什么?你说的是真的吗?我简直不敢相信自己的耳朵,这也太意外了吧。
IndexTTS2
CosyVoice2
AgentSteerTTS
Neutral Baseline
今天的天气预报显示,明天白天多云转晴,气温在15-25度之间,适合户外活动。
IndexTTS2
CosyVoice2
AgentSteerTTS

🎚️ Part 3: Emotion Intensity Control

Fine-grained control over emotion intensity levels. Same text, different intensity.

Emotion Text Low (30%) Medium (60%) High (100%)
😊 Happy 这个消息真是太好了。 微笑 开心 狂喜
😠 Angry 你这样做是不对的。 不悦 生气 暴怒
😢 Sad 这件事让我很难接受。 失落 难过 悲痛

Visualization Results

Attribute energy allocation analysis showing how AgentSteerTTS concentrates energy on target dimensions while reducing leakage.

"Angry but Restrained"

Angry+Restrained Heatmap
Figure 3a: AgentSteerTTS precisely controls the intensity balance between "Angry" and "Restrained", achieving nuanced expression that baseline models cannot replicate.

"Sad but Hopeful"

Sad+Hopeful Heatmap
Figure 3b: For the challenging "Sad but Hopeful" combination, AgentSteerTTS successfully maintains both conflicting emotions.