Supported Models#
VoxServe supports multiple TTS and STS model families. The tables below summarize the current families, their shorthand codes, and a representative Hugging Face identifier.
Text-to-speech (TTS)#
Family code |
Example Hugging Face model ID |
Description |
|---|---|---|
|
|
TTS model developed by Resemble AI. Using 0.5B LLM with flow matching + HiFT vocoder. Supports audio input for voice cloning. |
|
|
TTS model developed by Alibaba. Using 0.5B LLM with flow matching + HiFT vocoder. Supports audio input for voice cloning. |
|
|
TTS model developed by Sesame. Using 1B LLM and depth-wise model with Mimi detokenizer. |
|
|
TTS model developed by Canopy Labs. Using 3B LLM with SNAC detokenizer. |
|
|
TTS model developed by Zyphra. Using 1B LLM with DAC detokenizer. |
Speech-to-speech (STS)#
Family code |
Example Hugging Face model ID |
Description |
|---|---|---|
|
|
STS model developed by Z.ai. Using 9B LLM with flow matching + HiFT vocoder. |
|
|
STS model developed by StepFun. Using 8B LLM with flow matching + HiFT vocoder. |
Notes#
The examples above are representative model IDs. You can use local paths or other compatible variants within each family.
Some families support audio input (STS). Refer to the model card for input requirements.