Guides

Text-to-speech

Turn text into speech with POST /v1/tts, and tune generation with the sampling knobs.

POST /v1/tts renders text as speech in a single call:

bash

curl -s https://api.kalpalabs.ai/v1/tts \
  -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
  -d '{
    "text": "Hey there! How are you doing today?",
    "speaker": "0",
    "model": "kalpa-conversational-v1",
    "params": {"temperature": 0.7, "top_k": 40}
  }'

Field	Notes
`text`	1 – 8,000 characters.
`speaker`	A role label from the model's `speakers` (defaults to `"0"`). Labels are model-specific — see Models.
`model`	A public model id; omit for the default.
`params`	Sampling knobs, all optional — see below.

Under the hood TTS is sugar for a one-turn conversation: it is exactly /v1/converse with [{"speaker": "0", "text": "…"}] as the whole conversation. Use converse directly when you have prior turns to condition on.

The audio you get back

json

{
  "request_id": "…",
  "model": "kalpa-conversational-v1",
  "text": "Hey there! How are you doing today?",
  "audio": {
    "format": "wav",
    "sample_rate": 24000,
    "num_quantizers": 32,
    "data_b64": "UklGRi…"
  },
  "usage": { "input_chars": 35, "input_audio_seconds": 0.0, "output_audio_seconds": 2.6 }
}

data_b64 is a complete 16-bit PCM WAV file (mono, 24 kHz), base64-encoded — decode it and play:

python

import base64
with open("out.wav", "wb") as f:
    f.write(base64.b64decode(reply["audio"]["data_b64"]))

num_quantizers is how many RVQ levels were decoded into the waveform (32 = full fidelity; see the quantizers knob).

Generation parameters

All fields of params are optional; the defaults are the tuned starting point. The same knobs (with the live defaults and UI ranges) are served at GET /v1/info.

Param	Default	Range	What it does
`temperature`	`0.7`	0 – 1.5	Backbone sampling temperature (text + semantic audio). `0` = greedy.
`depth_temperature`	`null`	0 – 1.5	Acoustic-codes temperature. `null` = follows `temperature`.
`top_k`	`null`	≥ 1	Sample the backbone from the k most-likely tokens. `null` = full vocabulary; `40` is a good setting at nonzero temperatures.
`repetition_penalty`	`3.0`	0 – 6	Audio repetition penalty; prevents collapse to silence.
`penalty_window`	`20`	1 – 80	Frames of history the repetition penalty looks back over.
`max_new_tokens`	`512`	16 – 2048	Generation cap. ~12.5 audio frames ≈ 1 s of speech, so the default caps a reply at roughly 40 s.
`quantizers`	`null`	8 / 16 / 32	Decode only the first N RVQ levels (smaller, lower-fidelity audio). `null` = full depth.

Two practical notes:

For reproducible output, set temperature to 0 — with greedy decoding the same request returns the same speech.
If long generations trail into silence or repeat, raise repetition_penalty or lower max_new_tokens before touching anything else.