# Conversations > POST /v1/converse completes the open turn of a conversation — authored speech, contextual TTS, and spoken history. `POST /v1/converse` is the API's core. You send **one conversation** — a list of turns, oldest first — and the model **completes its last turn**. There is no separate "target text" or "target speaker" field; the last turn *is* the request. A turn has three fields, all combining naturally: ```json { "speaker": "0", "text": "…", "audio_wav_b64": "…" } ``` The rules: - Every turn **except the last** is grounded history: it must carry `text` and/or `audio_wav_b64`. - The **last turn is the open turn** — the one to complete. What it carries decides what the model does. - A last turn that already has both text and audio has nothing left to generate → `400`. ## Author the next turn An open turn with only a `speaker` asks the model to write *and* voice it: ```bash curl -s https://api.kalpalabs.ai/v1/converse \ -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \ -d '{ "conversation": [ {"speaker": "0", "text": "Did you end up trying that recipe?"}, {"speaker": "1", "text": "I did — the timing was the hard part."}, {"speaker": "0"} ] }' ``` The reply carries the authored text and its audio: ```json { "reply": { "speaker": "0", "text": "Same here. Did you keep the flame low like she said?", "audio": { "format": "wav", "sample_rate": 24000, "num_quantizers": 32, "data_b64": "…" } } } ``` ## Contextual TTS: render exact text, in context An open turn with `speaker` **and** `text` renders exactly that text, conditioned on everything before it — same voice, continued rhythm. This is how you re-voice an edited turn or drive a scripted dialogue: ```json { "conversation": [ {"speaker": "0", "text": "Welcome back to the show."}, {"speaker": "1", "text": "Glad to be here."}, {"speaker": "0", "text": "Let's pick up where we left off — episode twelve."} ] } ``` The model speaks the last line as speaker `"0"`, sounding like the same person who said the first. ## Spoken history and reference audio Any history turn may carry `audio_wav_b64` — base64 16-bit PCM WAV (a `data:` URI prefix is accepted), up to 25 MiB decoded per turn. Two uses: - **Spoken history**: pass the actual audio of earlier turns (yours or previous API replies) so the model hears the conversation instead of just reading it. - **Reference voice**: open a conversation with a turn containing a short clip of a voice, then have that `speaker` complete the open turn — the model continues in that voice. `usage.input_audio_seconds` meters the audio you send; see [Usage](/usage). ## Speaker labels `speaker` values are **positional role labels, not names** — the labels the model was trained on, listed per model in [`GET /v1/models`](/models) (the current conversational models use `"0"` and `"1"`, in turn order). Two things matter: - Use only the model's advertised labels. Anything else (a name, `speaker_0`, …) degrades output badly. - Keep a label bound to one voice within a conversation: `"0"` is whoever spoke first, `"1"` the other party. ## Limits | Cap | Value | |---|---| | Turns per conversation | 64 | | Text per turn | 8,000 characters | | Audio per turn | 25 MiB decoded WAV | The gateway rejects anything over these caps with `400 invalid_request` — see [Rate limits & errors](/rate-limits-and-errors).