# Text-to-speech

> Turn text into speech with POST /v1/tts, and tune generation with the sampling knobs.

`POST /v1/tts` renders text as speech in a single call:

```bash
curl -s https://api.kalpalabs.ai/v1/tts \
  -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
  -d '{
    "text": "Hey there! How are you doing today?",
    "speaker": "0",
    "model": "kalpa-conversational-v1",
    "params": {"temperature": 0.7, "top_k": 40}
  }'
```

| Field | Notes |
|---|---|
| `text` | 1 – 8,000 characters. |
| `speaker` | A role label from the model's `speakers` (defaults to `"0"`). Labels are model-specific — see [Models](/models). |
| `model` | A public model id; omit for the default. |
| `params` | Sampling knobs, all optional — see below. |

Under the hood TTS is sugar for a one-turn conversation: it is exactly [`/v1/converse`](/conversations) with `[{"speaker": "0", "text": "…"}]` as the whole conversation. Use `converse` directly when you have prior turns to condition on.

## The audio you get back

```json
{
  "request_id": "…",
  "model": "kalpa-conversational-v1",
  "text": "Hey there! How are you doing today?",
  "audio": {
    "format": "wav",
    "sample_rate": 24000,
    "num_quantizers": 32,
    "data_b64": "UklGRi…"
  },
  "usage": { "input_chars": 35, "input_audio_seconds": 0.0, "output_audio_seconds": 2.6 }
}
```

`data_b64` is a complete 16-bit PCM WAV file (mono, 24 kHz), base64-encoded — decode it and play:

```python
import base64
with open("out.wav", "wb") as f:
    f.write(base64.b64decode(reply["audio"]["data_b64"]))
```

`num_quantizers` is how many RVQ levels were decoded into the waveform (32 = full fidelity; see the `quantizers` knob).

## Generation parameters

All fields of `params` are optional; the defaults are the tuned starting point. The same knobs (with the live defaults and UI ranges) are served at `GET /v1/info`.

| Param | Default | Range | What it does |
|---|---|---|---|
| `temperature` | `0.7` | 0 – 1.5 | Backbone sampling temperature (text + semantic audio). `0` = greedy. |
| `depth_temperature` | `null` | 0 – 1.5 | Acoustic-codes temperature. `null` = follows `temperature`. |
| `top_k` | `null` | ≥ 1 | Sample the backbone from the k most-likely tokens. `null` = full vocabulary; `40` is a good setting at nonzero temperatures. |
| `repetition_penalty` | `3.0` | 0 – 6 | Audio repetition penalty; prevents collapse to silence. |
| `penalty_window` | `20` | 1 – 80 | Frames of history the repetition penalty looks back over. |
| `max_new_tokens` | `512` | 16 – 2048 | Generation cap. ~12.5 audio frames ≈ 1 s of speech, so the default caps a reply at roughly 40 s. |
| `quantizers` | `null` | 8 / 16 / 32 | Decode only the first N RVQ levels (smaller, lower-fidelity audio). `null` = full depth. |

Two practical notes:

- For reproducible output, set `temperature` to `0` — with greedy decoding the same request returns the same speech.
- If long generations trail into silence or repeat, raise `repetition_penalty` or lower `max_new_tokens` before touching anything else.
