# Conversations

> POST /v1/converse completes the open turn of a conversation — authored speech, contextual TTS, and spoken history.

`POST /v1/converse` is the API's core. You send **one conversation** — a list of turns, oldest first — and the model **completes its last turn**. There is no separate "target text" or "target speaker" field; the last turn *is* the request.

A turn has three fields, all combining naturally:

```json
{ "speaker": "0", "text": "…", "audio_wav_b64": "…" }
```

The rules:

- Every turn **except the last** is grounded history: it must carry `text` and/or `audio_wav_b64`.
- The **last turn is the open turn** — the one to complete. What it carries decides what the model does.
- A last turn that already has both text and audio has nothing left to generate → `400`.

## Author the next turn

An open turn with only a `speaker` asks the model to write *and* voice it:

```bash
curl -s https://api.kalpalabs.ai/v1/converse \
  -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
  -d '{
    "conversation": [
      {"speaker": "0", "text": "Did you end up trying that recipe?"},
      {"speaker": "1", "text": "I did — the timing was the hard part."},
      {"speaker": "0"}
    ]
  }'
```

The reply carries the authored text and its audio:

```json
{
  "reply": {
    "speaker": "0",
    "text": "Same here. Did you keep the flame low like she said?",
    "audio": { "format": "wav", "sample_rate": 24000, "num_quantizers": 32, "data_b64": "…" }
  }
}
```

## Contextual TTS: render exact text, in context

An open turn with `speaker` **and** `text` renders exactly that text, conditioned on everything before it — same voice, continued rhythm. This is how you re-voice an edited turn or drive a scripted dialogue:

```json
{
  "conversation": [
    {"speaker": "0", "text": "Welcome back to the show."},
    {"speaker": "1", "text": "Glad to be here."},
    {"speaker": "0", "text": "Let's pick up where we left off — episode twelve."}
  ]
}
```

The model speaks the last line as speaker `"0"`, sounding like the same person who said the first.

## Spoken history and reference audio

Any history turn may carry `audio_wav_b64` — base64 16-bit PCM WAV (a `data:` URI prefix is accepted), up to 25 MiB decoded per turn. Two uses:

- **Spoken history**: pass the actual audio of earlier turns (yours or previous API replies) so the model hears the conversation instead of just reading it.
- **Reference voice**: open a conversation with a turn containing a short clip of a voice, then have that `speaker` complete the open turn — the model continues in that voice.

`usage.input_audio_seconds` meters the audio you send; see [Usage](/usage).

## Speaker labels

`speaker` values are **positional role labels, not names** — the labels the model was trained on, listed per model in [`GET /v1/models`](/models) (the current conversational models use `"0"` and `"1"`, in turn order). Two things matter:

- Use only the model's advertised labels. Anything else (a name, `speaker_0`, …) degrades output badly.
- Keep a label bound to one voice within a conversation: `"0"` is whoever spoke first, `"1"` the other party.

## Limits

| Cap | Value |
|---|---|
| Turns per conversation | 64 |
| Text per turn | 8,000 characters |
| Audio per turn | 25 MiB decoded WAV |

The gateway rejects anything over these caps with `400 invalid_request` — see [Rate limits & errors](/rate-limits-and-errors).
