Guides
Text-to-speech
Turn text into speech with POST /v1/tts, and tune generation with the sampling knobs.
POST /v1/tts renders text as speech in a single call:
curl -s https://api.kalpalabs.ai/v1/tts \
-H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
-d '{
"text": "Hey there! How are you doing today?",
"speaker": "0",
"model": "kalpa-conversational-v1",
"params": {"temperature": 0.7, "top_k": 40}
}'| Field | Notes |
|---|---|
text | 1 – 8,000 characters. |
speaker | A role label from the model's speakers (defaults to "0"). Labels are model-specific — see Models. |
model | A public model id; omit for the default. |
params | Sampling knobs, all optional — see below. |
Under the hood TTS is sugar for a one-turn conversation: it is exactly /v1/converse with [{"speaker": "0", "text": "…"}] as the whole conversation. Use converse directly when you have prior turns to condition on.
The audio you get back
{
"request_id": "…",
"model": "kalpa-conversational-v1",
"text": "Hey there! How are you doing today?",
"audio": {
"format": "wav",
"sample_rate": 24000,
"num_quantizers": 32,
"data_b64": "UklGRi…"
},
"usage": { "input_chars": 35, "input_audio_seconds": 0.0, "output_audio_seconds": 2.6 }
}data_b64 is a complete 16-bit PCM WAV file (mono, 24 kHz), base64-encoded — decode it and play:
import base64
with open("out.wav", "wb") as f:
f.write(base64.b64decode(reply["audio"]["data_b64"]))num_quantizers is how many RVQ levels were decoded into the waveform (32 = full fidelity; see the quantizers knob).
Generation parameters
All fields of params are optional; the defaults are the tuned starting point. The same knobs (with the live defaults and UI ranges) are served at GET /v1/info.
| Param | Default | Range | What it does |
|---|---|---|---|
temperature | 0.7 | 0 – 1.5 | Backbone sampling temperature (text + semantic audio). 0 = greedy. |
depth_temperature | null | 0 – 1.5 | Acoustic-codes temperature. null = follows temperature. |
top_k | null | ≥ 1 | Sample the backbone from the k most-likely tokens. null = full vocabulary; 40 is a good setting at nonzero temperatures. |
repetition_penalty | 3.0 | 0 – 6 | Audio repetition penalty; prevents collapse to silence. |
penalty_window | 20 | 1 – 80 | Frames of history the repetition penalty looks back over. |
max_new_tokens | 512 | 16 – 2048 | Generation cap. ~12.5 audio frames ≈ 1 s of speech, so the default caps a reply at roughly 40 s. |
quantizers | null | 8 / 16 / 32 | Decode only the first N RVQ levels (smaller, lower-fidelity audio). null = full depth. |
Two practical notes:
- For reproducible output, set
temperatureto0— with greedy decoding the same request returns the same speech. - If long generations trail into silence or repeat, raise
repetition_penaltyor lowermax_new_tokensbefore touching anything else.