Quickstart
From API key to spoken audio in five minutes: one TTS call, then a conversation.
1. Get a key
Keys are provisioned per team while the API is in early access — write to [email protected]. Keep it server-side and export it where your code runs:
export KALPA_API_KEY=...2. Say something
POST /v1/tts takes text and returns spoken audio as base64 WAV. Decode it and you have a playable file:
curl -s https://api.kalpalabs.ai/v1/tts \
-H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
-d '{"text": "Hey there! How are you doing today?", "speaker": "0"}' \
| python3 -c 'import sys,json,base64; r=json.load(sys.stdin); open("out.wav","wb").write(base64.b64decode(r["audio"]["data_b64"])); print("wrote out.wav", r["usage"])'out.wav is mono 16-bit PCM at 24 kHz. The same call in Python:
import base64, requests, os
r = requests.post(
"https://api.kalpalabs.ai/v1/tts",
headers={"Authorization": f"Bearer {os.environ['KALPA_API_KEY']}"},
json={"text": "Hey there! How are you doing today?", "speaker": "0"},
)
r.raise_for_status()
reply = r.json()
with open("out.wav", "wb") as f:
f.write(base64.b64decode(reply["audio"]["data_b64"]))
print(reply["usage"]) # characters in, seconds of audio out3. Have a conversation
POST /v1/converse completes the last turn of a conversation. Here the last turn carries only a speaker — so the model authors it: it writes what speaker "1" would say next and voices it.
curl -s https://api.kalpalabs.ai/v1/converse \
-H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
-d '{
"conversation": [
{"speaker": "0", "text": "Hi, who are you?"},
{"speaker": "1"}
]
}'{
"request_id": "…",
"model": "kalpa-conversational-v1",
"reply": {
"speaker": "1",
"text": "I'm a speech model built by Kalpa Labs…",
"audio": { "format": "wav", "sample_rate": 24000, "num_quantizers": 32, "data_b64": "…" }
},
"usage": { "input_chars": 16, "input_audio_seconds": 0.0, "output_audio_seconds": 3.2 }
}Give the last turn a text instead and the model renders exactly that text in context — contextual TTS. The full semantics (spoken history, reference audio, speaker labels) are in Conversations.
Next
- Conversations — the open-turn model, audio history, contextual TTS.
- Text-to-speech — the generation knobs and what they do.
- Models — pick a model per request.
- API reference — every field, generated from openapi.json.