AgentSteerTTS

Multi-Agent Closed-Loop Steering Framework

A Multi-Agent Closed-Loop Steering Framework for Intent-Consistent Expressive Text-to-Speech

Bin Kang^1,2,3, Shaoguo Wen³, Yang Fan², Shunlong Wu⁴, Junjie Wang², Yulin Li², Junzhi Zhao⁵, Junle Wang³, Zhuotao Tian^2,*

¹University of Chinese Academy of Sciences, Beijing, China
²Shenzhen Loop Area Institute, Shenzhen, China
³Tencent Turinglab, Shenzhen, China
⁴Tsinghua University, Beijing, China
⁵Southwest Jiaotong University, Chengdu, China
^*Corresponding author

ICML 2026 Project Page

GitHub Paper

Abstract

While current TTS models can be highly expressive, precise control over fine-grained composite instructions remains difficult due to the structural mismatch between discrete textual intents and continuous acoustic realizations. Inspired by human cognitive decoupling, we propose AgentSteerTTS, a multi-agent, closed-loop framework for intent-faithful emotional steering.

We first introduce an adversarial disentanglement agent to mitigate speaker-emotion leakage by learning separable identity and emotion-prosody subspaces with leakage-suppressing regularization. Subsequently, we introduce a Dual-Stream Anchoring Controller, where a Retrieval Agent leverages a large-scale constructed library of fine-grained acoustic prototypes to ground abstract intent, while a Synthesis Agent transforms these cues into continuous control vectors through gated fusion. Finally, we implement a fast-slow feedback mechanism where a Fast Control Agent performs rapid latent gradient correction for intensity calibration, and a Supervisor Agent provides high-level perceptual critique to rectify semantic-acoustic discrepancies.

Method Overview

Figure 1: AgentSteerTTS Framework. Our method consists of four key components: (1) Adversarial Disentanglement Module (ADM) that reduces speaker-emotion leakage by factorizing identity and emotion-prosody representations; (2) Retrieval Agent that retrieves high-expressivity prototypes from a large-scale acoustic library with perceptual pruning; (3) Synthesis Agent that transforms discrete intents into continuous control vectors via gated fusion mechanism; (4) Fast-Slow Feedback mechanism for hierarchical semantic-acoustic alignment correction.

Research Motivation

Our pilot study reveals that existing TTS models exhibit systematic biases when handling composite emotional instructions, failing to faithfully express multi-dimensional emotional intents.

Composite Emotion Control Analysis

Figure 2: Semantic-Acoustic Misalignment in Existing Methods. When given composite instructions (e.g., "Happy but slightly Arrogant"), state-of-the-art models show significant deviation from target emotion profiles:

Target Suppression: Target emotion expression drops 25%-45%
Energy Leakage: +0.08 average leakage to irrelevant dimensions
Joint Satisfaction: Only ~30% for composite instructions

Audio Demonstrations

We present side-by-side audio samples that make the comparison immediate: same content and speaker condition, different emotional controllability.

1. Composite control

Fine-grained mixed instructions test whether a model can preserve both dominant and secondary emotional cues.

Core task

2. Single emotions

The new ICML asset bundle compares neutral and five basic emotions across IndexTTS2, CosyVoice2, and ours.

New samples

3. Intensity steering

Low, medium, and high intensity samples demonstrate controllable acoustic strength under fixed semantics.

Calibration

Part 1: Composite Emotion Control

Complex instructions combining multiple emotional attributes - our core contribution.

Reference Audio	Composite Instruction	Text Content
Speaker Reference	Happy but slightly Arrogant	我早就知道会是这个结果，这种事对我来说太简单了！
	Sad but with a hint of Hope	虽然现在很难过，但我相信明天会更好的。
	Angry but Restrained	我很生气，但我会控制好自己的情绪，你给我记住。
	Surprised and slightly Fearful	那个声音是什么？突然吓了我一跳！
	Disgusted with underlying Anger	这种行为真是太让人恶心了，我无法容忍！
	Happy and Excited	天哪，我中奖了！太不可思议了，太开心了！

Part 2: Single Emotion Benchmark Set

Each row uses the same transcribed text prompt across IndexTTS2, CosyVoice2, and AgentSteerTTS, making the emotional delivery differences easier to compare.

Shared speaker reference
科技的发展日新月异，人工智能正在改变我们的生活方式。从智能手机到自动驾驶，每一项创新都让世界变得更加便捷和高效。

Happy Positive

太棒了，我终于收到了期待已久的录取通知书，这真是我人生中最开心的一天。

IndexTTS2

CosyVoice2

AgentSteerTTS

Sad Negative

她默默地站在窗前，看着窗外的细雨，回想起那些再也回不去的美好时光，眼眶不禁湿润了。

IndexTTS2

CosyVoice2

AgentSteerTTS

Angry High arousal

你怎么能这样做？我已经反复强调过很多次了，你根本就没有把我的话当回事。

IndexTTS2

CosyVoice2

AgentSteerTTS

Fearful Tense

不要过来，求求你不要过来，我听到门外有奇怪的脚步声，越来越近了。

IndexTTS2

CosyVoice2

AgentSteerTTS

Surprised Reactive

什么？你说的是真的吗？我简直不敢相信自己的耳朵，这也太意外了吧。

IndexTTS2

CosyVoice2

AgentSteerTTS

Neutral Baseline

今天的天气预报显示，明天白天多云转晴，气温在15-25度之间，适合户外活动。

IndexTTS2

CosyVoice2

AgentSteerTTS

🎚️ Part 3: Emotion Intensity Control

Fine-grained control over emotion intensity levels. Same text, different intensity.

Emotion	Text	Low (30%)	Medium (60%)	High (100%)
😊 Happy	这个消息真是太好了。	微笑	开心	狂喜
😠 Angry	你这样做是不对的。	不悦	生气	暴怒
😢 Sad	这件事让我很难接受。	失落	难过	悲痛

Visualization Results

Attribute energy allocation analysis showing how AgentSteerTTS concentrates energy on target dimensions while reducing leakage.

"Angry but Restrained"

Figure 3a: AgentSteerTTS precisely controls the intensity balance between "Angry" and "Restrained", achieving nuanced expression that baseline models cannot replicate.

"Sad but Hopeful"

Figure 3b: For the challenging "Sad but Hopeful" combination, AgentSteerTTS successfully maintains both conflicting emotions.