Conversations
POST /v1/converse completes the open turn of a conversation — authored speech, contextual TTS, and spoken history.
POST /v1/converse is the API's core. You send one conversation — a list of turns, oldest first — and the model completes its last turn. There is no separate "target text" or "target speaker" field; the last turn is the request.
A turn has three fields, all combining naturally:
{ "speaker": "0", "text": "…", "audio_wav_b64": "…" }The rules:
- Every turn except the last is grounded history: it must carry
textand/oraudio_wav_b64. - The last turn is the open turn — the one to complete. What it carries decides what the model does.
- A last turn that already has both text and audio has nothing left to generate →
400.
Author the next turn
An open turn with only a speaker asks the model to write and voice it:
curl -s https://api.kalpalabs.ai/v1/converse \
-H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
-d '{
"conversation": [
{"speaker": "0", "text": "Did you end up trying that recipe?"},
{"speaker": "1", "text": "I did — the timing was the hard part."},
{"speaker": "0"}
]
}'The reply carries the authored text and its audio:
{
"reply": {
"speaker": "0",
"text": "Same here. Did you keep the flame low like she said?",
"audio": { "format": "wav", "sample_rate": 24000, "num_quantizers": 32, "data_b64": "…" }
}
}Contextual TTS: render exact text, in context
An open turn with speaker and text renders exactly that text, conditioned on everything before it — same voice, continued rhythm. This is how you re-voice an edited turn or drive a scripted dialogue:
{
"conversation": [
{"speaker": "0", "text": "Welcome back to the show."},
{"speaker": "1", "text": "Glad to be here."},
{"speaker": "0", "text": "Let's pick up where we left off — episode twelve."}
]
}The model speaks the last line as speaker "0", sounding like the same person who said the first.
Spoken history and reference audio
Any history turn may carry audio_wav_b64 — base64 16-bit PCM WAV (a data: URI prefix is accepted), up to 25 MiB decoded per turn. Two uses:
- Spoken history: pass the actual audio of earlier turns (yours or previous API replies) so the model hears the conversation instead of just reading it.
- Reference voice: open a conversation with a turn containing a short clip of a voice, then have that
speakercomplete the open turn — the model continues in that voice.
usage.input_audio_seconds meters the audio you send; see Usage.
Speaker labels
speaker values are positional role labels, not names — the labels the model was trained on, listed per model in GET /v1/models (the current conversational models use "0" and "1", in turn order). Two things matter:
- Use only the model's advertised labels. Anything else (a name,
speaker_0, …) degrades output badly. - Keep a label bound to one voice within a conversation:
"0"is whoever spoke first,"1"the other party.
Limits
| Cap | Value |
|---|---|
| Turns per conversation | 64 |
| Text per turn | 8,000 characters |
| Audio per turn | 25 MiB decoded WAV |
The gateway rejects anything over these caps with 400 invalid_request — see Rate limits & errors.