Qwen3-TTS#

Qwen3-TTS is a state-of-the-art text-to-speech model from Alibaba’s Qwen team. VoxServe supports the all three variants (Qwen3-TTS-1.7B) with input/output streaming inference.

Model Variants#

  1. Custom Voice Model

Uses predefined speaker embeddings for consistent, high-quality voices.

python -m vox_serve.launch --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --port 8000
import requests

response = requests.post("http://localhost:8000/generate", json={
    "text": "Hello, this is a test.",
    "speaker": "ryan",  # Predefined speaker
    "language": "english"
})

See the model config for a list of speakers and languages supported.

2. Base Model#

Clone any voice using a reference audio sample and its transcript.

python -m vox_serve.launch --model Qwen/Qwen3-TTS-12Hz-1.7B-Base --port 8000
response = requests.post("http://localhost:8000/generate", json={
    "text": "Hello, this is a cloned voice.",
    "audio_path": "/path/to/reference.wav",
    "ref_text": "Transcript of the reference audio.",
    "language": "english"
})

This mode uses in-context learning to adapt the model to the reference voice.

3. Voice Design Mode#

Generate voices based on natural language descriptions.

python -m vox_serve.launch --model Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --port 8000
response = requests.post("http://localhost:8000/generate", json={
    "text": "Hello, this is a designed voice.",
    "instruct": "A warm, friendly female voice with a slight British accent.",
    "language": "english"
})

Input Streaming#

We also support input text streaming mode, ideal for connecting with text LLM to build a voice chatbot. For this, start the server with input streaming scheduler:

python -m vox_serve.launch --model Qwen/Qwen3-TTS-12Hz-1.7B-Base --port 8000 --scheduler input_streaming

And see the example client script here.