Start
.md ↗

Quickstart

From API key to spoken audio in five minutes: one TTS call, then a conversation.

1. Get a key

Keys are provisioned per team while the API is in early access — write to [email protected]. Keep it server-side and export it where your code runs:

bash
export KALPA_API_KEY=...

2. Say something

POST /v1/tts takes text and returns spoken audio as base64 WAV. Decode it and you have a playable file:

bash
curl -s https://api.kalpalabs.ai/v1/tts \
  -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
  -d '{"text": "Hey there! How are you doing today?", "speaker": "0"}' \
  | python3 -c 'import sys,json,base64; r=json.load(sys.stdin); open("out.wav","wb").write(base64.b64decode(r["audio"]["data_b64"])); print("wrote out.wav", r["usage"])'

out.wav is mono 16-bit PCM at 24 kHz. The same call in Python:

python
import base64, requests, os

r = requests.post(
    "https://api.kalpalabs.ai/v1/tts",
    headers={"Authorization": f"Bearer {os.environ['KALPA_API_KEY']}"},
    json={"text": "Hey there! How are you doing today?", "speaker": "0"},
)
r.raise_for_status()
reply = r.json()
with open("out.wav", "wb") as f:
    f.write(base64.b64decode(reply["audio"]["data_b64"]))
print(reply["usage"])  # characters in, seconds of audio out

3. Have a conversation

POST /v1/converse completes the last turn of a conversation. Here the last turn carries only a speaker — so the model authors it: it writes what speaker "1" would say next and voices it.

bash
curl -s https://api.kalpalabs.ai/v1/converse \
  -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
  -d '{
    "conversation": [
      {"speaker": "0", "text": "Hi, who are you?"},
      {"speaker": "1"}
    ]
  }'
json
{
  "request_id": "…",
  "model": "kalpa-conversational-v1",
  "reply": {
    "speaker": "1",
    "text": "I'm a speech model built by Kalpa Labs…",
    "audio": { "format": "wav", "sample_rate": 24000, "num_quantizers": 32, "data_b64": "…" }
  },
  "usage": { "input_chars": 16, "input_audio_seconds": 0.0, "output_audio_seconds": 3.2 }
}

Give the last turn a text instead and the model renders exactly that text in context — contextual TTS. The full semantics (spoken history, reference audio, speaker labels) are in Conversations.

Next