<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://vox-serve.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://vox-serve.github.io/" rel="alternate" type="text/html" /><updated>2026-02-09T07:02:31+00:00</updated><id>https://vox-serve.github.io/feed.xml</id><title type="html">VoxServe</title><subtitle>Blog</subtitle><entry><title type="html">Light-Speed Qwen3-TTS Serving at Scale with VoxServe</title><link href="https://vox-serve.github.io/2026/02/09/qwen3-tts-support.html" rel="alternate" type="text/html" title="Light-Speed Qwen3-TTS Serving at Scale with VoxServe" /><published>2026-02-09T00:00:00+00:00</published><updated>2026-02-09T00:00:00+00:00</updated><id>https://vox-serve.github.io/2026/02/09/qwen3-tts-support</id><content type="html" xml:base="https://vox-serve.github.io/2026/02/09/qwen3-tts-support.html"><![CDATA[<p><strong>TL;DR:</strong> VoxServe now fully supports the <strong><a href="https://huggingface.co/collections/Qwen/qwen3-tts">Qwen3-TTS</a></strong> model family (Base, CustomVoice, and VoiceDesign) with true end-to-end streaming. You get streaming text input and audio output, chunked audio decoding, continuous batching, and CUDA Graph optimizations for high performance at scale.</p>

<h2 id="highlights">Highlights</h2>

<p><strong>Ultra-low latency</strong>: VoxServe is built for real-time speech, delivering extremely low inference latency. In the demo below, a TTS request achieves a Time-To-First-Audio (TTFA) as low as <strong>40 ms</strong> on an NVIDIA H100 GPU.</p>

<div style="padding:60% 0 0 0;position:relative;"><iframe src="https://player.vimeo.com/video/1163095537?badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin" style="position:absolute;top:0;left:0;width:100%;height:100%;" title="voxserve-qwen3tts-demo1"></iframe></div>
<script src="https://player.vimeo.com/api/player.js"></script>

<p><strong>Real-time LLM chat integration</strong>: Qwen3-TTS supports incremental text input, and VoxServe supports that capability, making it easy to build end-to-end voice chatbots. The video below shows VoxServe connected to a local LLM, achieving low end-to-end response latency.</p>

<div style="padding:60% 0 0 0;position:relative;"><iframe src="https://player.vimeo.com/video/1163095770?badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin" style="position:absolute;top:0;left:0;width:100%;height:100%;" title="voxserve-qwen3tts-demo2"></iframe></div>
<script src="https://player.vimeo.com/api/player.js"></script>

<p><strong>High throughput</strong>: VoxServe is optimized not just for low latency, but also for high-throughput serving under load. The figure below compares streaming performance against <a href="https://github.com/vllm-project/vllm-omni">vLLM-Omni</a> (v0.14.0) across increasing request rates. The y-axis reports TTFA (time to first audio chunk). The annotation boxes report <em>streaming viability</em>, i.e., the fraction of audio chunks delivered in time to avoid playback gaps on the client. All experiments follow the setup in our <a href="https://arxiv.org/abs/2602.00269">paper</a>. The benchmark script is available <a href="https://github.com/vox-serve/vox-serve/blob/main/benchmark/goodput.py">here</a>.</p>

<p align="center">
<img src="/assets/figs/qwen3-serving-performance.png" alt="Serving performance for Qwen3-TTS." width="600" />
<br />
Serving performance.
</p>

<p>While vLLM-Omni supports online serving for Qwen3-TTS, it does not currently support streaming audio generation, which keeps TTFA high even at low request rates. VoxServe treats streaming generation as a first-class objective, enabling effective batching while keeping TTFA low even under heavy concurrency.</p>

<h2 id="usage">Usage</h2>

<p>Install VoxServe and serve a Qwen3-TTS checkpoint:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>vox-serve
vox-serve <span class="nt">--model</span> Qwen/Qwen3-TTS-12Hz-1.7B-Base <span class="nt">--port</span> 8000
</code></pre></div></div>

<p>Generate speech with a simple <code class="language-plaintext highlighter-rouge">curl</code> request:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-X</span> POST <span class="s2">"http://localhost:8000/generate"</span> <span class="se">\</span>
  <span class="nt">-F</span> <span class="s2">"text=Hello, this is a demonstration of Qwen3-TTS served by VoxServe."</span> <span class="se">\</span>
  <span class="nt">-F</span> <span class="s2">"streaming=true"</span> <span class="se">\</span>
  <span class="nt">-o</span> output.wav
</code></pre></div></div>

<p>For detailed examples, see the Qwen3-TTS usage page:</p>

<ul>
  <li><a href="https://vox-serve.github.io/vox-serve/usage/qwen3-tts.html">https://vox-serve.github.io/vox-serve/usage/qwen3-tts.html</a></li>
</ul>

<p>We also provide an <a href="https://github.com/vox-serve/vox-serve/tree/main/examples/playground">interactive playground</a> for quick experimentation.</p>

<h2 id="technical-details">Technical Details</h2>

<p>Qwen3-TTS is a state-of-the-art text-to-speech model from Alibaba’s Qwen team. It delivers strong audio quality, but serving it well is non-trivial: the architecture is multi-stage, supports multiple modes (Base, CustomVoice, VoiceDesign), and requires careful input/output streaming for low-latency inference.</p>

<p>VoxServe is a high-efficiency serving system built specifically for speech models. It provides a stable execution abstraction that accommodates a wide range of modern speech architectures, while enabling system-level optimizations like continuous batching, cache management, and CUDA Graph execution. As with <a href="https://vox-serve.github.io/vox-serve/models.html">many other models</a> in our ecosystem, VoxServe supports the full Qwen3-TTS feature set with low latency in streaming scenarios.</p>

<p>Below, we outline how VoxServe maps cleanly onto Qwen3-TTS.</p>

<h3 id="model-architecture">Model Architecture</h3>

<p align="center">
<img src="/assets/figs/qwen3-tts.png" alt="Qwen3-TTS model architecture. Image taken from https://github.com/QwenLM/Qwen3-TTS" width="600" />
<br />
Model architecture of Qwen3-TTS.
</p>

<p>Qwen3-TTS is composed of four major components:</p>

<ol>
  <li><strong>Speech Encoder</strong>: optionally encodes reference speech for voice cloning (Base variant)</li>
  <li><strong>Qwen3 LM (Talker)</strong>: generates speech tokens for codebook 0</li>
  <li><strong>MTP Module (Codec Predictor)</strong>: generates speech tokens for codebooks 1–15</li>
  <li><strong>Streaming Codec Decoder</strong>: converts 16 codebooks into waveform audio</li>
</ol>

<p>Three components (talker, codec predictor, and codec decoder) operate autoregressively. That creates engineering challenges around request scheduling, cache management, and GPU utilization. The codec decoder also includes audio-specific operations (e.g., convolutions) that introduce additional state to manage for streaming. Finally, the three Qwen3-TTS variants require different input configurations, adding more surface area to the serving stack.</p>

<p>Despite this complexity, the model fits naturally into VoxServe’s execution interface.</p>

<p align="center">
<img src="/assets/figs/system-overview.png" alt="System design of VoxServe." width="900" />
<br />
System design of VoxServe.
</p>

<p>VoxServe implements a shared execution pipeline for all the models:</p>

<p><strong>Preprocess → LM Forward → Sampling (→ Depth Forward → Depth Sampling) → Postprocess</strong></p>

<ul>
  <li><strong><a href="https://github.com/vox-serve/vox-serve/blob/v0.1.0/vox_serve/model/qwen3_tts.py#L1373">Preprocess</a></strong>: Formats inputs and runs the speech encoder when needed. Qwen3-TTS inputs vary by variant: speaker IDs for CustomVoice, reference audio/text for Base voice cloning, and instruction-style prompts for VoiceDesign.</li>
  <li><strong><a href="https://github.com/vox-serve/vox-serve/blob/v0.1.0/vox_serve/model/qwen3_tts.py#L1805">LM Forward</a> &amp; <a href="https://github.com/vox-serve/vox-serve/blob/v0.1.0/vox_serve/model/qwen3_tts.py#L1863">Sampling</a></strong>: Runs the talker (Qwen3 LM). Each step consumes a single text token plus 16 audio tokens, the content of which varies depending on whether input streaming is enabled, and voice cloning can additionally inject audio feature vectors. VoxServe’s interface supports this cleanly via three buffers: <code class="language-plaintext highlighter-rouge">input_ids</code>, <code class="language-plaintext highlighter-rouge">input_masks</code>, and <code class="language-plaintext highlighter-rouge">input_features</code>. We did not need to change this interface to support the full functionality of the Qwen3-TTS model.</li>
  <li><strong><a href="https://github.com/vox-serve/vox-serve/blob/v0.1.0/vox_serve/model/qwen3_tts.py#L1964">Depth Forward</a> &amp; <a href="https://github.com/vox-serve/vox-serve/blob/v0.1.0/vox_serve/model/qwen3_tts.py#L1981">Sampling</a></strong>: Runs the codec predictor (MTP). VoxServe already supports this class of “depth” modules (e.g., in CSM-1B), so Qwen3-TTS plugs into an existing interface.</li>
  <li><strong><a href="https://github.com/vox-serve/vox-serve/blob/v0.1.0/vox_serve/model/qwen3_tts.py#L2006">Postprocess</a></strong>: Runs the codec decoder and emits waveform audio. For streaming, this stage requires cache management for both attention and convolutional layers. VoxServe already handled detokenizer caching for other models (e.g., CosyVoice 2); enabling it for Qwen3-TTS required just defining a new <a href="https://github.com/vox-serve/vox-serve/blob/v0.1.0/vox_serve/tokenizer/qwen3_codec.py#L34">cache class</a> and wiring the decoder to read/write cache state.</li>
</ul>

<p>The key point is that <strong>we support Qwen3-TTS’s model-specific details without materially changing the layers above</strong>, such as the scheduler and worker in the diagram. That design choice matters because it lets existing system optimizations – continuous batching, KV/detokenizer cache management, CUDA Graph execution, and scheduling policies – apply to Qwen3-TTS with minimal friction. This is especially important for speech models, where architectures vary significantly across families. For deeper detail on the interface and optimizations, see our <a href="https://arxiv.org/abs/2602.00269">paper</a>.</p>

<h3 id="input-streaming-implementation">Input Streaming Implementation</h3>

<p>VoxServe supports Qwen3-TTS’s incremental text input feature through a <a href="https://github.com/vox-serve/vox-serve/blob/v0.1.0/vox_serve/scheduler/input_streaming.py#L26">custom scheduler implementation</a>.</p>

<p>VoxServe’s scheduler is explicitly designed to expose performance optimization hooks. In our paper, we describe <a href="https://github.com/vox-serve/vox-serve/blob/v0.1.0/vox_serve/scheduler/online.py#L9">streaming-oriented scheduling policies</a> that prioritize low TTFA while preserving output streamability. This is implemented via request selection logic applied at each scheduler iteration.</p>

<p>Input streaming can be seamlessly supported in a similar way by swapping in a different scheduler implementation. The main challenge arises when audio generation outpaces incoming text; VoxServe addresses this by employing request selection rules that manage partially available input, ensuring the output stream remains smooth and uninterrupted.</p>

<h2 id="try-voxserve">Try VoxServe</h2>

<p>VoxServe is fully open source on GitHub: <a href="https://github.com/vox-serve/vox-serve">https://github.com/vox-serve/vox-serve</a></p>

<p>Give it a try and let us know what you think!</p>]]></content><author><name>[&quot;VoxServe Team&quot;]</name></author><summary type="html"><![CDATA[TL;DR: VoxServe now fully supports the Qwen3-TTS model family (Base, CustomVoice, and VoiceDesign) with true end-to-end streaming. You get streaming text input and audio output, chunked audio decoding, continuous batching, and CUDA Graph optimizations for high performance at scale.]]></summary></entry><entry><title type="html">Efficient Serving of SpeechLMs with VoxServe</title><link href="https://vox-serve.github.io/2025/09/29/introducing-vox-serve.html" rel="alternate" type="text/html" title="Efficient Serving of SpeechLMs with VoxServe" /><published>2025-09-29T00:00:00+00:00</published><updated>2025-09-29T00:00:00+00:00</updated><id>https://vox-serve.github.io/2025/09/29/introducing-vox-serve</id><content type="html" xml:base="https://vox-serve.github.io/2025/09/29/introducing-vox-serve.html"><![CDATA[<p>TL;DR: We present <strong>VoxServe</strong>, a high-throughput, low-latency serving system designed specifically for Speech Language Models (SpeechLMs). Unlike other LLM serving frameworks, VoxServe is built with speech as its primary focus, integrating functionalities such as audio detokenization and streaming generation into the core system. It offers a unified abstraction layer that supports a wide range of speech models through a single, consistent interface. In addition, VoxServe introduces a novel scheduling algorithm optimized for speech services with various scenarios.</p>

<p>Code is open-sourced here: <a href="https://github.com/vox-serve/vox-serve">https://github.com/vox-serve/vox-serve</a></p>

<hr />

<p>In recent years, <strong>Speech Language Models (SpeechLMs)</strong>, such as Text-to-Speech (TTS) and Speech-to-Speech (STS) models built on Language Model (LM) backbones, have gained significant traction. The release of powerful open-source models is opening up exciting opportunities for speech AI applications.</p>

<p>However, deploying these models in practice remains challenging:</p>

<ol>
  <li><strong>Lack of standardized abstractions</strong>. SpeechLMs vary widely in architecture, and there is no common framework to unify inference across them. This makes it difficult to switch between models.</li>
  <li><strong>Limited focus on efficiency</strong>. To our knowledge, no inference system exists that is designed specifically for SpeechLMs with an emphasis on low-latency, high-throughput deployment. As a result, serving these models can be slow and costly.</li>
</ol>

<p>In practice, each new speech model often comes with its own custom inference stack, which may not necessarily prioritize efficiency, making it cumbersome to switch between models. Repurposing existing LLM serving systems also demands significant effort due to fundamental differences in architecture and inference algorithms.</p>

<p>VoxServe addresses these challenges by providing a unified interface that supports diverse SpeechLMs, with <strong>high performance as the core design goal</strong>.</p>

<h2 id="speechlm-background">SpeechLM Background</h2>

<p>Modern SpeechLMs typically consist of an <strong>LM backbone</strong> and an <strong>audio detokenizer</strong> model: the LM autoregressively generates discrete audio tokens, which the detokenizer then converts into continuous audio data.</p>

<p align="center">
<img src="/assets/figs/speech-lm-overview.png" alt="Overview of typical SpeechLMs." width="600" />
<br />
Overview of typical SpeechLMs.
</p>

<p>Serving these models efficiently poses unique challenges. At every inference step, two different models must run in tandem, while the resulting binary audio data needs to be streamed to the client. To enable stream generation, both models must be carefully scheduled at the right intervals.</p>

<p>Additionally, there are numerous model-specific complexities that complicate implementation, including multi-codebook modeling, depth transformers, audio input encoders, repetition penalties, and watermarking requirements. Audio detokenizers themselves vary widely in architecture, size, and latency characteristics, further increasing the difficulty.</p>

<p>For optimal serving performance, request scheduling must be carefully designed to account for both the LM backbone and the audio detokenizer.</p>

<h2 id="voxserve">VoxServe</h2>

<p>We solve these challenges by designing VoxServe, a new serving system for SpeechLMs from the ground up. VoxServe currently supports the following four models, with more on the way:</p>

<ul>
  <li><a href="https://huggingface.co/sesame/csm-1b">CSM</a></li>
  <li><a href="https://huggingface.co/canopylabs/orpheus-3b-0.1-ft">Orpheus</a></li>
  <li><a href="https://huggingface.co/Zyphra/Zonos-v0.1-transformer">Zonos</a></li>
  <li><a href="https://huggingface.co/zai-org/glm-4-voice-9b">GLM-Voice</a></li>
  <li><a href="https://huggingface.co/stepfun-ai/Step-Audio-2-mini">Step-Audio-2-Mini</a></li>
</ul>

<p>The VoxServe model class is designed to natively support multi-stream inference, a common requirement for speech workloads. Its abstraction is carefully engineered to strike a balance between flexibility and efficiency, making it compatible with batch inference, CUDA graphs, and streaming.</p>

<p>A typical inference pipeline in VoxServe includes:</p>

<ul>
  <li><strong>Preprocessing</strong>: Preparing inputs for the LM backbone, such as prompt formatting, encoder inference, and metadata or masking setup for sampling.</li>
  <li><strong>LM forward</strong>: Running the LM backbone to generate logits for the next tokens.</li>
  <li><strong>Sampling</strong>: Selecting the next tokens from logits, which may involve algorithms like repetition penalty, classifier-free guidance, or filtering based on token type (e.g., audio vs. text tokens).</li>
  <li><strong>(Optional) Depth forward</strong>: Executing the depth transformer for models that autoregressively generate tokens across multiple codebooks.</li>
  <li><strong>Postprocessing</strong>: Converting tokens into audio data using the detokenizer, with a unified interface across diverse architectures.</li>
</ul>

<p>By standardizing this workflow while remaining adaptable to model-specific variations, VoxServe simplifies deployment and ensures performance across a wide range of SpeechLMs.</p>

<h2 id="performance-optimizations">Performance Optimizations</h2>

<p>VoxServe goes beyond basic model inference, introducing optimizations specifically tailored to SpeechLMs in order to maximize performance.</p>

<h3 id="scheduling-algorithm">Scheduling Algorithm</h3>

<p>Since SpeechLMs comprise multiple components (the LM backbone and the audio detokenizer), the scheduling of requests between them has a direct impact on performance. Importantly, we note that speech applications differ in their performance requirements, so VoxServe implements specialized scheduling strategies for two distinct scenarios: <strong>online serving</strong> and <strong>offline serving</strong>.</p>

<p><strong>Online serving scenarios</strong>: For interactive applications like voice chatbots, where many requests arrive in random intervals,  we define the following two metrics:</p>

<ul>
  <li><strong>Time-To-First-Audio (TTFA)</strong>: The latency from user input to the first audio chunk. Unlike Time-To-First-Token (TTFT) in LLMs, this requires generating multiple tokens and running the detokenizer (and sometimes the encoder) before producing the first chunk.</li>
  <li><strong>Streaming Viability</strong>: Once the first audio chunk is ready, subsequent audio must be generated faster than the playback speed to prevent audio disruption that the client experiences.</li>
</ul>

<p>A notable difference from text generation is that speed improvements beyond playback rate have diminishing returns (except for the first chunk), i.e., as long as the generation satisfies the real-time requirements, there is no benefit in generating faster than that. This opens the door to a scheduling strategy that prioritizes requests only when they are critical (either because the first audio chunk has not yet been produced or because streaming viability is at risk).</p>

<p>VoxServe classifies requests as critical or non-critical based on their current progress. Critical requests are prioritized, while non-critical ones can be delayed slightly to improve overall hardware utilization without hurting latency or streaming quality.</p>

<p>Intuitively, you can delay the inference of some part of the model for better hardware utilization, as long as it affects neither TTFA nor streaming viability.</p>

<p align="center">
<img src="/assets/figs/online-scheduling.png" alt="Examples of online scheduling optimizations." width="800" />
<br />
Examples of online scheduling optimizations.
</p>

<p><strong>Offline serving scenarios</strong>:
On the other hand, for workloads such as audiobook or podcast generation, the priority shifts from latency to throughput.</p>

<p>For offline serving, the performance metric is end-to-end throughput, which we measure by the Real-Time Factor (RTF), i.e., the total length of generated audio divided by the time it takes to generate it.</p>

<p>The scheduling strategy is simpler: maximize throughput by keeping hardware fully utilized, typically by running large batches at each stage (LM backbone and detokenizer).</p>

<h3 id="asynchronous-execution">Asynchronous Execution</h3>

<p>To minimize overhead from complex scheduling and metadata processing, VoxServe adopts an asynchronous execution pipeline. Both the LM backbone and the audio detokenizer run asynchronously with respect to their schedulers, leveraging a delayed stop-decision mechanism (as proposed in <a href="https://arxiv.org/abs/2408.12757">NanoFlow</a>).</p>

<p align="center">
<img src="/assets/figs/async-pipeline.png" alt="Pipeline of asynchronous execution." width="800" />
<br />
Pipeline of asynchronous execution.
</p>

<h2 id="evaluation">Evaluation</h2>

<p>We evaluate the performance of VoxServe on <a href="https://huggingface.co/canopylabs/orpheus-3b-0.1-ft">Orpheus-3B</a> model on a single H100 GPU. As a baseline, we compare against the <a href="https://github.com/canopyai/Orpheus-TTS">official implementation</a> provided by the model developers, which uses vLLM for LM backbone inference. All evaluations use greedy sampling with no repetition penalty.</p>

<p>For online serving, we measure TTFA and streaming viability rate (the fraction of audio chunks meeting real-time playback requirements) under varying request arrival rates (modeled as a Poisson distribution, with each request generating 1024 tokens).</p>

<p>The baseline system shows long TTFA and poor streaming viability, even under light loads. In contrast, VoxServe maintains low TTFA and meets real-time playback requirements thanks to its optimized detokenizer implementation, batching strategies, and use of CUDA graphs. By adopting the optimized scheduling algorithm, TTFA is kept low with an even higher request rate.</p>

<p align="center">
<img src="/assets/figs/perf-online.png" alt="Performance for online serving scenario." width="800" />
<br />
Performance for online serving scenario.
</p>

<p>VoxServe achieves better throughput for the offline serving scenario as well. In an experiment processing 100 requests of equal length (1024 tokens), VoxServe significantly outperforms the baseline, owing to its coordinated scheduling of the LM backbone and audio detokenizer. With optimized scheduling enabled, throughput improves by an additional ~15%.</p>

<p align="center">
<img src="/assets/figs/perf-offline.png" alt="Performance for offline serving scenario." width="600" />
<br />
Performance for offline serving scenario.
</p>

<h2 id="whats-next">What’s Next?</h2>

<p>We are actively working on supporting more models and further performance improvements.</p>

<p>If you’d like to try it out, you can install VoxServe with:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>vox-serve
</code></pre></div></div>

<p>Also, please check out the <a href="https://github.com/vox-serve/vox-serve">code</a> and feel free to post any requests or bug reports at our <a href="https://github.com/vox-serve/vox-serve/issues">GitHub Issues</a>.</p>]]></content><author><name>[&quot;[Keisuke Kamahori](https://kamahori.org), [Baris Kasikci](http://bariskasikci.org/) (University of Washington)&quot;]</name></author><summary type="html"><![CDATA[TL;DR: We present VoxServe, a high-throughput, low-latency serving system designed specifically for Speech Language Models (SpeechLMs). Unlike other LLM serving frameworks, VoxServe is built with speech as its primary focus, integrating functionalities such as audio detokenization and streaming generation into the core system. It offers a unified abstraction layer that supports a wide range of speech models through a single, consistent interface. In addition, VoxServe introduces a novel scheduling algorithm optimized for speech services with various scenarios.]]></summary></entry></feed>