Supported Models#

VoxServe supports multiple TTS and STS model families. The tables below summarize the current families, their shorthand codes, and a representative Hugging Face identifier.

Text-to-speech (TTS)#

TTS model families#

Family code

Example Hugging Face model ID

Description

chatterbox

ResembleAI/chatterbox (link)

TTS model developed by Resemble AI. Using 0.5B LLM with flow matching + HiFT vocoder. Supports audio input for voice cloning.

cosyvoice2

FunAudioLLM/CosyVoice2-0.5B (link)

TTS model developed by Alibaba. Using 0.5B LLM with flow matching + HiFT vocoder. Supports audio input for voice cloning.

csm

sesame/csm-1b (link)

TTS model developed by Sesame. Using 1B LLM and depth-wise model with Mimi detokenizer.

orpheus

canopylabs/orpheus-3b-0.1-ft (link)

TTS model developed by Canopy Labs. Using 3B LLM with SNAC detokenizer.

zonos

Zyphra/Zonos-v0.1-transformer (link)

TTS model developed by Zyphra. Using 1B LLM with DAC detokenizer.

Speech-to-speech (STS)#

STS model families#

Family code

Example Hugging Face model ID

Description

glm

zai-org/glm-4-voice-9b (link)

STS model developed by Z.ai. Using 9B LLM with flow matching + HiFT vocoder.

step

stepfun-ai/Step-Audio-2-mini (link)

STS model developed by StepFun. Using 8B LLM with flow matching + HiFT vocoder.

Notes#

  • The examples above are representative model IDs. You can use local paths or other compatible variants within each family.

  • Some families support audio input (STS). Refer to the model card for input requirements.