Guides

Conversations

POST /v1/converse completes the open turn of a conversation — authored speech, contextual TTS, and spoken history.

POST /v1/converse is the API's core. You send one conversation — a list of turns, oldest first — and the model completes its last turn. There is no separate "target text" or "target speaker" field; the last turn is the request.

A turn has three fields, all combining naturally:

json

{ "speaker": "0", "text": "…", "audio_wav_b64": "…" }

The rules:

Every turn except the last is grounded history: it must carry text and/or audio_wav_b64.
The last turn is the open turn — the one to complete. What it carries decides what the model does.
A last turn that already has both text and audio has nothing left to generate → 400.

Author the next turn

An open turn with only a speaker asks the model to write and voice it:

bash

curl -s https://api.kalpalabs.ai/v1/converse \
  -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
  -d '{
    "conversation": [
      {"speaker": "0", "text": "Did you end up trying that recipe?"},
      {"speaker": "1", "text": "I did — the timing was the hard part."},
      {"speaker": "0"}
    ]
  }'

The reply carries the authored text and its audio:

json

{
  "reply": {
    "speaker": "0",
    "text": "Same here. Did you keep the flame low like she said?",
    "audio": { "format": "wav", "sample_rate": 24000, "num_quantizers": 32, "data_b64": "…" }
  }
}

Contextual TTS: render exact text, in context

An open turn with speaker and text renders exactly that text, conditioned on everything before it — same voice, continued rhythm. This is how you re-voice an edited turn or drive a scripted dialogue:

json

{
  "conversation": [
    {"speaker": "0", "text": "Welcome back to the show."},
    {"speaker": "1", "text": "Glad to be here."},
    {"speaker": "0", "text": "Let's pick up where we left off — episode twelve."}
  ]
}

The model speaks the last line as speaker "0", sounding like the same person who said the first.

Spoken history and reference audio

Any history turn may carry audio_wav_b64 — base64 16-bit PCM WAV (a data: URI prefix is accepted), up to 25 MiB decoded per turn. Two uses:

Spoken history: pass the actual audio of earlier turns (yours or previous API replies) so the model hears the conversation instead of just reading it.
Reference voice: open a conversation with a turn containing a short clip of a voice, then have that speaker complete the open turn — the model continues in that voice.

usage.input_audio_seconds meters the audio you send; see Usage.

Speaker labels

speaker values are positional role labels, not names — the labels the model was trained on, listed per model in GET /v1/models (the current conversational models use "0" and "1", in turn order). Two things matter:

Use only the model's advertised labels. Anything else (a name, speaker_0, …) degrades output badly.
Keep a label bound to one voice within a conversation: "0" is whoever spoke first, "1" the other party.

Limits

Cap	Value
Turns per conversation	64
Text per turn	8,000 characters
Audio per turn	25 MiB decoded WAV

The gateway rejects anything over these caps with 400 invalid_request — see Rate limits & errors.