CosyVoice2#
CosyVoice2-0.5B is a text-to-speech model developed by Alibaba FunAudioLLM team, using a 0.5B LLM with flow matching and HiFT vocoder for natural speech synthesis.
Quickstart#
Start the server:
python -m vox_serve.launch --model FunAudioLLM/CosyVoice2-0.5B --port 8000
Generate speech:
import requests
response = requests.post("http://localhost:8000/generate", json={
"text": "Hello, this is CosyVoice speaking!"
})
with open("output.wav", "wb") as f:
f.write(response.content)
API Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
string |
required |
Text to synthesize |
|
boolean |
|
Enable streaming response |
Special Tokens#
CosyVoice2 supports special tokens for expressive speech:
Token |
Description |
|---|---|
|
Insert a breath |
|
Insert laughter |
|
Insert a cough |
|
Insert a sigh |
|
Insert background noise |
|
Emphasize text |
|
Speak with laughter |
Examples#
Basic Usage#
import requests
response = requests.post("http://localhost:8000/generate", json={
"text": "The quick brown fox jumps over the lazy dog."
})
Expressive Speech#
import requests
# With special tokens
response = requests.post("http://localhost:8000/generate", json={
"text": "Hello! [breath] How are you doing today? <laughter>That's so funny!</laughter>"
})
Streaming Audio#
import requests
with requests.post(
"http://localhost:8000/generate",
json={"text": "Hello world!", "streaming": True},
stream=True
) as response:
with open("output.wav", "wb") as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
Using curl#
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{"text": "Hello from CosyVoice!"}' \
-o output.wav
CLI Options#
python -m vox_serve.launch \
--model FunAudioLLM/CosyVoice2-0.5B \
--port 8000 \
--temperature 0.7 \
--top_p 0.9
Architecture Notes#
CosyVoice2 features:
Qwen-based 0.5B LLM backbone
Flow matching decoder for high-quality audio
HiFT vocoder for efficient waveform synthesis
Support for expressive speech via special tokens
Voice cloning via reference audio is planned for a future release.