# Text-to-speech > Turn text into speech with POST /v1/tts, and tune generation with the sampling knobs. `POST /v1/tts` renders text as speech in a single call: ```bash curl -s https://api.kalpalabs.ai/v1/tts \ -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \ -d '{ "text": "Hey there! How are you doing today?", "speaker": "0", "model": "kalpa-conversational-v1", "params": {"temperature": 0.7, "top_k": 40} }' ``` | Field | Notes | |---|---| | `text` | 1 – 8,000 characters. | | `speaker` | A role label from the model's `speakers` (defaults to `"0"`). Labels are model-specific — see [Models](/models). | | `model` | A public model id; omit for the default. | | `params` | Sampling knobs, all optional — see below. | Under the hood TTS is sugar for a one-turn conversation: it is exactly [`/v1/converse`](/conversations) with `[{"speaker": "0", "text": "…"}]` as the whole conversation. Use `converse` directly when you have prior turns to condition on. ## The audio you get back ```json { "request_id": "…", "model": "kalpa-conversational-v1", "text": "Hey there! How are you doing today?", "audio": { "format": "wav", "sample_rate": 24000, "num_quantizers": 32, "data_b64": "UklGRi…" }, "usage": { "input_chars": 35, "input_audio_seconds": 0.0, "output_audio_seconds": 2.6 } } ``` `data_b64` is a complete 16-bit PCM WAV file (mono, 24 kHz), base64-encoded — decode it and play: ```python import base64 with open("out.wav", "wb") as f: f.write(base64.b64decode(reply["audio"]["data_b64"])) ``` `num_quantizers` is how many RVQ levels were decoded into the waveform (32 = full fidelity; see the `quantizers` knob). ## Generation parameters All fields of `params` are optional; the defaults are the tuned starting point. The same knobs (with the live defaults and UI ranges) are served at `GET /v1/info`. | Param | Default | Range | What it does | |---|---|---|---| | `temperature` | `0.7` | 0 – 1.5 | Backbone sampling temperature (text + semantic audio). `0` = greedy. | | `depth_temperature` | `null` | 0 – 1.5 | Acoustic-codes temperature. `null` = follows `temperature`. | | `top_k` | `null` | ≥ 1 | Sample the backbone from the k most-likely tokens. `null` = full vocabulary; `40` is a good setting at nonzero temperatures. | | `repetition_penalty` | `3.0` | 0 – 6 | Audio repetition penalty; prevents collapse to silence. | | `penalty_window` | `20` | 1 – 80 | Frames of history the repetition penalty looks back over. | | `max_new_tokens` | `512` | 16 – 2048 | Generation cap. ~12.5 audio frames ≈ 1 s of speech, so the default caps a reply at roughly 40 s. | | `quantizers` | `null` | 8 / 16 / 32 | Decode only the first N RVQ levels (smaller, lower-fidelity audio). `null` = full depth. | Two practical notes: - For reproducible output, set `temperature` to `0` — with greedy decoding the same request returns the same speech. - If long generations trail into silence or repeat, raise `repetition_penalty` or lower `max_new_tokens` before touching anything else.