Guides
.md ↗

Text-to-speech

Turn text into speech with POST /v1/tts, and tune generation with the sampling knobs.

POST /v1/tts renders text as speech in a single call:

bash
curl -s https://api.kalpalabs.ai/v1/tts \
  -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
  -d '{
    "text": "Hey there! How are you doing today?",
    "speaker": "0",
    "model": "kalpa-conversational-v1",
    "params": {"temperature": 0.7, "top_k": 40}
  }'
FieldNotes
text1 – 8,000 characters.
speakerA role label from the model's speakers (defaults to "0"). Labels are model-specific — see Models.
modelA public model id; omit for the default.
paramsSampling knobs, all optional — see below.

Under the hood TTS is sugar for a one-turn conversation: it is exactly /v1/converse with [{"speaker": "0", "text": "…"}] as the whole conversation. Use converse directly when you have prior turns to condition on.

The audio you get back

json
{
  "request_id": "…",
  "model": "kalpa-conversational-v1",
  "text": "Hey there! How are you doing today?",
  "audio": {
    "format": "wav",
    "sample_rate": 24000,
    "num_quantizers": 32,
    "data_b64": "UklGRi…"
  },
  "usage": { "input_chars": 35, "input_audio_seconds": 0.0, "output_audio_seconds": 2.6 }
}

data_b64 is a complete 16-bit PCM WAV file (mono, 24 kHz), base64-encoded — decode it and play:

python
import base64
with open("out.wav", "wb") as f:
    f.write(base64.b64decode(reply["audio"]["data_b64"]))

num_quantizers is how many RVQ levels were decoded into the waveform (32 = full fidelity; see the quantizers knob).

Generation parameters

All fields of params are optional; the defaults are the tuned starting point. The same knobs (with the live defaults and UI ranges) are served at GET /v1/info.

ParamDefaultRangeWhat it does
temperature0.70 – 1.5Backbone sampling temperature (text + semantic audio). 0 = greedy.
depth_temperaturenull0 – 1.5Acoustic-codes temperature. null = follows temperature.
top_knull≥ 1Sample the backbone from the k most-likely tokens. null = full vocabulary; 40 is a good setting at nonzero temperatures.
repetition_penalty3.00 – 6Audio repetition penalty; prevents collapse to silence.
penalty_window201 – 80Frames of history the repetition penalty looks back over.
max_new_tokens51216 – 2048Generation cap. ~12.5 audio frames ≈ 1 s of speech, so the default caps a reply at roughly 40 s.
quantizersnull8 / 16 / 32Decode only the first N RVQ levels (smaller, lower-fidelity audio). null = full depth.

Two practical notes:

  • For reproducible output, set temperature to 0 — with greedy decoding the same request returns the same speech.
  • If long generations trail into silence or repeat, raise repetition_penalty or lower max_new_tokens before touching anything else.