# Kalpa Speech API — full documentation

> REST API for Kalpa’s multi-speaker conversational speech models: text-to-speech and conversation completion (the model speaks the open turn of a conversation).


---

# Kalpa Speech API

> Speech models that speak in context — one API for text-to-speech and multi-speaker conversation.

The Kalpa Speech API serves multi-speaker conversational speech models over plain HTTP. It does two things:

- **Text-to-speech** — `POST /v1/tts` turns text into spoken audio.
- **Conversation** — `POST /v1/converse` takes a conversation of turns and **speaks the open (final) turn**: either authoring it outright (the model writes text and voices it) or rendering text you supply, in the voice and rhythm of the conversation so far.

There is one conversation shape and no special cases: TTS is just a one-turn conversation. Everything is JSON; audio crosses the wire as base64-encoded 16-bit PCM WAV (mono, 24 kHz).

```bash
export KALPA_API_KEY=...   # provisioned per team — see Authentication

curl -s https://api.kalpalabs.ai/v1/tts \
  -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
  -d '{"text": "Hey there! How are you doing today?", "speaker": "0"}'
```

## At a glance

| Endpoint | What it does |
|---|---|
| `POST /v1/tts` | Text in, speech out |
| `POST /v1/converse` | Complete the open turn of a conversation |
| `GET /v1/models` | The public model registry |
| `GET /v1/info` | Backend info, default params, request limits |
| `GET /v1/usage` | Your key's metered usage |
| `GET /health` | Liveness (no auth) |

Base URL: `https://api.kalpalabs.ai`. All `/v1/*` endpoints require a key ([Authentication](/authentication)); every error comes back in one envelope ([Rate limits & errors](/rate-limits-and-errors)).

## Built for agents

These docs assume your first reader may be a model. Every page is plain markdown at a stable URL — append `.md` to any path (this page is [/index.md](/index.md)). The whole site is indexed in [/llms.txt](/llms.txt), concatenated in [/llms-full.txt](/llms-full.txt), and the full contract is machine-readable at [/openapi.json](/openapi.json) — the same committed artifact the [API reference](/reference) and our clients are generated from. Point your agent at any of them.

## Getting access

The API is in early access. Write to [hello@kalpalabs.ai](mailto:hello@kalpalabs.ai) for a key, or try the models interactively first at [studio.kalpalabs.ai](https://studio.kalpalabs.ai).


---

# Quickstart

> From API key to spoken audio in five minutes: one TTS call, then a conversation.

## 1. Get a key

Keys are provisioned per team while the API is in early access — write to [hello@kalpalabs.ai](mailto:hello@kalpalabs.ai). Keep it server-side and export it where your code runs:

```bash
export KALPA_API_KEY=...
```

## 2. Say something

`POST /v1/tts` takes text and returns spoken audio as base64 WAV. Decode it and you have a playable file:

```bash
curl -s https://api.kalpalabs.ai/v1/tts \
  -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
  -d '{"text": "Hey there! How are you doing today?", "speaker": "0"}' \
  | python3 -c 'import sys,json,base64; r=json.load(sys.stdin); open("out.wav","wb").write(base64.b64decode(r["audio"]["data_b64"])); print("wrote out.wav", r["usage"])'
```

`out.wav` is mono 16-bit PCM at 24 kHz. The same call in Python:

```python
import base64, requests, os

r = requests.post(
    "https://api.kalpalabs.ai/v1/tts",
    headers={"Authorization": f"Bearer {os.environ['KALPA_API_KEY']}"},
    json={"text": "Hey there! How are you doing today?", "speaker": "0"},
)
r.raise_for_status()
reply = r.json()
with open("out.wav", "wb") as f:
    f.write(base64.b64decode(reply["audio"]["data_b64"]))
print(reply["usage"])  # characters in, seconds of audio out
```

## 3. Have a conversation

`POST /v1/converse` completes the **last turn** of a conversation. Here the last turn carries only a `speaker` — so the model authors it: it writes what speaker `"1"` would say next and voices it.

```bash
curl -s https://api.kalpalabs.ai/v1/converse \
  -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
  -d '{
    "conversation": [
      {"speaker": "0", "text": "Hi, who are you?"},
      {"speaker": "1"}
    ]
  }'
```

```json
{
  "request_id": "…",
  "model": "kalpa-conversational-v1",
  "reply": {
    "speaker": "1",
    "text": "I'm a speech model built by Kalpa Labs…",
    "audio": { "format": "wav", "sample_rate": 24000, "num_quantizers": 32, "data_b64": "…" }
  },
  "usage": { "input_chars": 16, "input_audio_seconds": 0.0, "output_audio_seconds": 3.2 }
}
```

Give the last turn a `text` instead and the model renders exactly that text in context — contextual TTS. The full semantics (spoken history, reference audio, speaker labels) are in [Conversations](/conversations).

## Next

- [Conversations](/conversations) — the open-turn model, audio history, contextual TTS.
- [Text-to-speech](/text-to-speech) — the generation knobs and what they do.
- [Models](/models) — pick a model per request.
- [API reference](/reference) — every field, generated from [openapi.json](/openapi.json).


---

# Authentication

> Bearer keys, where to keep them, and how to correlate requests when something goes wrong.

Every `/v1/*` endpoint authenticates with an API key. Send it as a bearer token (preferred):

```bash
curl -s https://api.kalpalabs.ai/v1/models \
  -H "Authorization: Bearer $KALPA_API_KEY"
```

or, where headers are awkward to compose, as `X-API-Key`:

```bash
curl -s https://api.kalpalabs.ai/v1/models -H "X-API-Key: $KALPA_API_KEY"
```

A missing or invalid key returns `401` in the standard envelope:

```json
{ "error": { "type": "authentication_error", "message": "Invalid API key.", "request_id": "…" } }
```

## Keys

- Keys are provisioned per team while the API is in early access — write to [hello@kalpalabs.ai](mailto:hello@kalpalabs.ai) to get one, rotate one, or raise its limits.
- Each key carries its own [rate limits](/rate-limits-and-errors) and its own [usage meter](/usage).
- Treat the key like a password: call the API from your servers, not from browsers or shipped apps. If a key leaks, ask us to rotate it.

## Request IDs

Every response carries an `X-Request-ID` header, echoed into error envelopes and our logs. Pass your own `X-Request-ID` header to correlate with your systems, and include the id when reporting a problem.


---

# Text-to-speech

> Turn text into speech with POST /v1/tts, and tune generation with the sampling knobs.

`POST /v1/tts` renders text as speech in a single call:

```bash
curl -s https://api.kalpalabs.ai/v1/tts \
  -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
  -d '{
    "text": "Hey there! How are you doing today?",
    "speaker": "0",
    "model": "kalpa-conversational-v1",
    "params": {"temperature": 0.7, "top_k": 40}
  }'
```

| Field | Notes |
|---|---|
| `text` | 1 – 8,000 characters. |
| `speaker` | A role label from the model's `speakers` (defaults to `"0"`). Labels are model-specific — see [Models](/models). |
| `model` | A public model id; omit for the default. |
| `params` | Sampling knobs, all optional — see below. |

Under the hood TTS is sugar for a one-turn conversation: it is exactly [`/v1/converse`](/conversations) with `[{"speaker": "0", "text": "…"}]` as the whole conversation. Use `converse` directly when you have prior turns to condition on.

## The audio you get back

```json
{
  "request_id": "…",
  "model": "kalpa-conversational-v1",
  "text": "Hey there! How are you doing today?",
  "audio": {
    "format": "wav",
    "sample_rate": 24000,
    "num_quantizers": 32,
    "data_b64": "UklGRi…"
  },
  "usage": { "input_chars": 35, "input_audio_seconds": 0.0, "output_audio_seconds": 2.6 }
}
```

`data_b64` is a complete 16-bit PCM WAV file (mono, 24 kHz), base64-encoded — decode it and play:

```python
import base64
with open("out.wav", "wb") as f:
    f.write(base64.b64decode(reply["audio"]["data_b64"]))
```

`num_quantizers` is how many RVQ levels were decoded into the waveform (32 = full fidelity; see the `quantizers` knob).

## Generation parameters

All fields of `params` are optional; the defaults are the tuned starting point. The same knobs (with the live defaults and UI ranges) are served at `GET /v1/info`.

| Param | Default | Range | What it does |
|---|---|---|---|
| `temperature` | `0.7` | 0 – 1.5 | Backbone sampling temperature (text + semantic audio). `0` = greedy. |
| `depth_temperature` | `null` | 0 – 1.5 | Acoustic-codes temperature. `null` = follows `temperature`. |
| `top_k` | `null` | ≥ 1 | Sample the backbone from the k most-likely tokens. `null` = full vocabulary; `40` is a good setting at nonzero temperatures. |
| `repetition_penalty` | `3.0` | 0 – 6 | Audio repetition penalty; prevents collapse to silence. |
| `penalty_window` | `20` | 1 – 80 | Frames of history the repetition penalty looks back over. |
| `max_new_tokens` | `512` | 16 – 2048 | Generation cap. ~12.5 audio frames ≈ 1 s of speech, so the default caps a reply at roughly 40 s. |
| `quantizers` | `null` | 8 / 16 / 32 | Decode only the first N RVQ levels (smaller, lower-fidelity audio). `null` = full depth. |

Two practical notes:

- For reproducible output, set `temperature` to `0` — with greedy decoding the same request returns the same speech.
- If long generations trail into silence or repeat, raise `repetition_penalty` or lower `max_new_tokens` before touching anything else.


---

# Conversations

> POST /v1/converse completes the open turn of a conversation — authored speech, contextual TTS, and spoken history.

`POST /v1/converse` is the API's core. You send **one conversation** — a list of turns, oldest first — and the model **completes its last turn**. There is no separate "target text" or "target speaker" field; the last turn *is* the request.

A turn has three fields, all combining naturally:

```json
{ "speaker": "0", "text": "…", "audio_wav_b64": "…" }
```

The rules:

- Every turn **except the last** is grounded history: it must carry `text` and/or `audio_wav_b64`.
- The **last turn is the open turn** — the one to complete. What it carries decides what the model does.
- A last turn that already has both text and audio has nothing left to generate → `400`.

## Author the next turn

An open turn with only a `speaker` asks the model to write *and* voice it:

```bash
curl -s https://api.kalpalabs.ai/v1/converse \
  -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
  -d '{
    "conversation": [
      {"speaker": "0", "text": "Did you end up trying that recipe?"},
      {"speaker": "1", "text": "I did — the timing was the hard part."},
      {"speaker": "0"}
    ]
  }'
```

The reply carries the authored text and its audio:

```json
{
  "reply": {
    "speaker": "0",
    "text": "Same here. Did you keep the flame low like she said?",
    "audio": { "format": "wav", "sample_rate": 24000, "num_quantizers": 32, "data_b64": "…" }
  }
}
```

## Contextual TTS: render exact text, in context

An open turn with `speaker` **and** `text` renders exactly that text, conditioned on everything before it — same voice, continued rhythm. This is how you re-voice an edited turn or drive a scripted dialogue:

```json
{
  "conversation": [
    {"speaker": "0", "text": "Welcome back to the show."},
    {"speaker": "1", "text": "Glad to be here."},
    {"speaker": "0", "text": "Let's pick up where we left off — episode twelve."}
  ]
}
```

The model speaks the last line as speaker `"0"`, sounding like the same person who said the first.

## Spoken history and reference audio

Any history turn may carry `audio_wav_b64` — base64 16-bit PCM WAV (a `data:` URI prefix is accepted), up to 25 MiB decoded per turn. Two uses:

- **Spoken history**: pass the actual audio of earlier turns (yours or previous API replies) so the model hears the conversation instead of just reading it.
- **Reference voice**: open a conversation with a turn containing a short clip of a voice, then have that `speaker` complete the open turn — the model continues in that voice.

`usage.input_audio_seconds` meters the audio you send; see [Usage](/usage).

## Speaker labels

`speaker` values are **positional role labels, not names** — the labels the model was trained on, listed per model in [`GET /v1/models`](/models) (the current conversational models use `"0"` and `"1"`, in turn order). Two things matter:

- Use only the model's advertised labels. Anything else (a name, `speaker_0`, …) degrades output badly.
- Keep a label bound to one voice within a conversation: `"0"` is whoever spoke first, `"1"` the other party.

## Limits

| Cap | Value |
|---|---|
| Turns per conversation | 64 |
| Text per turn | 8,000 characters |
| Audio per turn | 25 MiB decoded WAV |

The gateway rejects anything over these caps with `400 invalid_request` — see [Rate limits & errors](/rate-limits-and-errors).


---

# Models

> The public model registry, per-request model switching, and what it costs in latency.

The API exposes stable public model ids that hide checkpoints and infrastructure. List them:

```bash
curl -s https://api.kalpalabs.ai/v1/models -H "Authorization: Bearer $KALPA_API_KEY"
```

```json
{
  "data": [
    {
      "id": "kalpa-conversational-v1",
      "display_name": "Kalpa Conversational v1",
      "description": "Flagship multi-speaker conversational speech model (TTS + converse).",
      "modes": ["converse", "tts"],
      "speakers": ["0", "1"],
      "default": true
    },
    { "id": "kalpa-conversational-8b",   "modes": ["converse", "tts"], "speakers": ["0", "1"], "default": false },
    { "id": "kalpa-conversational-mini", "modes": ["converse", "tts"], "speakers": ["0", "1"], "default": false }
  ]
}
```

| Model | Use it for |
|---|---|
| `kalpa-conversational-v1` | The flagship — best quality; the default when `model` is omitted. |
| `kalpa-conversational-8b` | 8B variant — quality close to flagship at lower serving cost. |
| `kalpa-conversational-mini` | Compact variant — fastest to load and cheapest to run. |

## Choosing a model per request

Every generation endpoint takes a `model` field:

```json
{ "text": "…", "model": "kalpa-conversational-mini" }
```

- Omit it (or send `null`) for the default model. The response's `model` field always echoes the **resolved** public id, so logs stay unambiguous.
- An unknown id, or a model that doesn't support the endpoint's mode, returns `400 invalid_request`.

## Speakers are per model

Each card's `speakers` lists the role labels that model understands, in turn order. Don't hardcode them — read the card and use its labels. The details (and why wrong labels degrade audio) are in [Conversations](/conversations).

## Switching cost

One model is resident on the accelerator at a time. Requests to the resident model are fast; **the first request after a switch pays the load** — from seconds for the mini model to tens of seconds for the flagship. If your traffic is latency-sensitive, keep it on one model rather than alternating, and expect the first call to a cold model to be slow (set client timeouts accordingly).


---

# Rate limits & errors

> One error envelope for everything, per-key token buckets, and the caps the gateway enforces.

## Rate limits

Each key has a sustained requests-per-minute allowance and a burst capacity (a token bucket: `burst` requests available at once, refilling at the sustained rate). Both are set when the key is provisioned — ask [hello@kalpalabs.ai](mailto:hello@kalpalabs.ai) to raise them.

Every response reports where you stand:

| Header | Meaning |
|---|---|
| `X-RateLimit-Limit` | Your sustained requests/minute |
| `X-RateLimit-Remaining` | Requests left in the bucket right now |
| `X-RateLimit-Reset` | Seconds until the bucket refills |

Past the limit you get `429` with a `Retry-After` header:

```json
{ "error": { "type": "rate_limit_exceeded", "message": "Rate limit exceeded. Retry after 1.2s.", "request_id": "…" } }
```

Honor `Retry-After` and back off; all generation requests are safe to retry (nothing is committed on a failed call).

## The error envelope

Every error — validation, auth, rate limit, model failure, crash — is one shape:

```json
{ "error": { "type": "…", "message": "…", "request_id": "…" } }
```

| Status | `type` | Meaning |
|---|---|---|
| `400` | `invalid_request` | Semantically invalid: over a cap, undecodable audio, unknown model, a conversation that breaks the [turn rules](/conversations). |
| `401` | `authentication_error` | Missing or invalid API key. |
| `404` | `not_found` | No such path. |
| `405` | `method_not_allowed` | Wrong HTTP method for the path. |
| `422` | `invalid_request` | The body doesn't match the schema (missing field, wrong type, out-of-range value). |
| `429` | `rate_limit_exceeded` | Over your key's limit — honor `Retry-After`. |
| `500` | `internal_error` | Unexpected failure on our side. Report it with the `request_id`. |
| `502` | `inference_error` | The model backend failed or timed out. Retryable. |

## Request caps

The gateway enforces hard caps before anything reaches a model:

| Cap | Value |
|---|---|
| Text per request/turn | 8,000 characters |
| Turns per conversation | 64 |
| Audio per turn | 25 MiB decoded WAV |

Current values are always served at `GET /v1/info` under `limits`.

## Debugging a failed call

1. Read `error.type` — it's stable and machine-matchable; `message` is for humans.
2. `4xx` other than `429`: fix the request (the message says which field).
3. `429` / `502`: retry with backoff (`Retry-After` for 429).
4. Anything persistent: send us the `request_id` (also in the `X-Request-ID` response header).


---

# Usage

> What every response meters, and reading your key’s running totals from GET /v1/usage.

## What every response meters

Each generation response carries a `usage` object — the metered quantities for that call:

```json
{ "usage": { "input_chars": 42, "input_audio_seconds": 6.5, "output_audio_seconds": 3.2 } }
```

| Field | Meaning |
|---|---|
| `input_chars` | Characters of input text billed for the request. |
| `input_audio_seconds` | Seconds of audio you supplied (spoken history / reference clips in [converse](/conversations)). |
| `output_audio_seconds` | Seconds of audio generated for you. |

Log these on your side if you want per-request attribution — the API's own accounting is keyed to your API key, not to your users.

## Your running totals

`GET /v1/usage` returns the running totals for the key making the request:

```bash
curl -s https://api.kalpalabs.ai/v1/usage -H "Authorization: Bearer $KALPA_API_KEY"
```

```json
{
  "key_id": "acme",
  "requests": 1284,
  "input_chars": 91230,
  "input_audio_seconds": 411.0,
  "output_audio_seconds": 3120.5,
  "last_request_ts": 1751500000.0
}
```

`key_id` is the non-secret label of your key (it's what appears in our logs — the key itself never does). For invoicing-grade reports or historical breakdowns, write to [hello@kalpalabs.ai](mailto:hello@kalpalabs.ai).


---

# API reference

> Every endpoint, field and error — generated from the committed OpenAPI contract.

Base URL: `https://api.kalpalabs.ai`. Every request and response body is JSON. Authenticated
endpoints take `Authorization: Bearer $KALPA_API_KEY` (or `X-API-Key`). Every
error, on every endpoint, is the one envelope:

```json
{ "error": { "type": "rate_limit_exceeded", "message": "…", "request_id": "…" } }
```

## POST /v1/tts

**Synthesize speech from text.**

Render the given text as speech (24 kHz mono WAV) in the requested speaker's voice.

### Request — `TtsRequest`

| Field | Type | Default | Constraints | Description |
|---|---|---|---|---|
| `text` *(required)* | `string` |  | 1 – 8000 chars | Text to speak. |
| `model` | `string \| null` |  |  | Public model id (see GET /v1/models). Omit/null for the default model. |
| `params` | `GenParamsModel` |  |  |  |
| `params.depth_temperature` | `number \| null` |  | 0 – 1.5 | Acoustic temperature; null = follow temperature. |
| `params.max_new_tokens` | `integer` | `512` | 16 – 2048 |  |
| `params.penalty_window` | `integer` | `20` | 1 – 80 |  |
| `params.quantizers` | `integer \| null` |  | ≥ 1 | Decode only the first N RVQ levels; null = full depth. |
| `params.repetition_penalty` | `number` | `3` | 0 – 6 |  |
| `params.temperature` | `number` | `0.7` | 0 – 1.5 |  |
| `params.top_k` | `integer \| null` |  | ≥ 1 | Backbone top-k; null = full vocabulary. |
| `speaker` | `string` | `"0"` |  | Speaker role to render the text as (one of the model's `speakers`; see GET /v1/models). |

### Response 200 — `TtsResponse`

| Field | Type | Default | Constraints | Description |
|---|---|---|---|---|
| `audio` *(required)* | `AudioPayload` |  |  |  |
| `audio.data_b64` *(required)* | `string` |  |  | Base64-encoded 16-bit PCM WAV (mono). |
| `audio.num_quantizers` *(required)* | `integer` |  |  | Number of RVQ levels decoded into this audio. |
| `audio.sample_rate` *(required)* | `integer` |  |  | Sample rate of the audio in Hz. |
| `audio.format` | `string` | `"wav"` |  | Container/encoding of `data_b64` (16-bit PCM WAV). |
| `model` *(required)* | `string` |  |  |  |
| `request_id` *(required)* | `string` |  |  |  |
| `text` *(required)* | `string` |  |  | The text that was spoken (echoes the request). |
| `usage` *(required)* | `Usage` |  |  |  |
| `usage.input_audio_seconds` | `number` | `0` |  | Seconds of input audio supplied (converse). |
| `usage.input_chars` | `integer` | `0` |  | Characters of input text billed for this request. |
| `usage.output_audio_seconds` | `number` | `0` |  | Seconds of audio generated. |
| `meta` | `object` |  |  | Backend-specific diagnostics (latency, frames, …). |

Errors: `401`, `429`, `502` (+ `422` on schema violations); see [Rate limits & errors](/rate-limits-and-errors).

```bash
curl -s https://api.kalpalabs.ai/v1/tts \
  -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
  -d '{"text": "Hey there! How are you doing today?", "speaker": "0"}'
```

## POST /v1/converse

**Complete the open (final) turn of a conversation.**

Given a conversation, complete its last ('open') turn. A speaker-only open turn is authored (text + audio); an open turn with text is rendered as that speaker, conditioned on the prior turns (contextual TTS).

### Request — `ConverseRequest`

| Field | Type | Default | Constraints | Description |
|---|---|---|---|---|
| `conversation` *(required)* | `ConversationTurnModel[]` |  | 1 – 64 items | The conversation, oldest turn first; the last turn is the open turn to complete. |
| `conversation[].audio_wav_b64` | `string \| null` |  |  | Base64 16-bit PCM WAV of this turn's audio, if any. |
| `conversation[].speaker` | `string` | `"0"` |  | Role label for this turn (one of the model's `speakers`). |
| `conversation[].text` | `string \| null` |  | ≤ 8000 chars | Text spoken in this turn, if known. |
| `model` | `string \| null` |  |  | Public model id (see GET /v1/models). Omit/null for the default model. |
| `params` | `GenParamsModel` |  |  |  |
| `params.depth_temperature` | `number \| null` |  | 0 – 1.5 | Acoustic temperature; null = follow temperature. |
| `params.max_new_tokens` | `integer` | `512` | 16 – 2048 |  |
| `params.penalty_window` | `integer` | `20` | 1 – 80 |  |
| `params.quantizers` | `integer \| null` |  | ≥ 1 | Decode only the first N RVQ levels; null = full depth. |
| `params.repetition_penalty` | `number` | `3` | 0 – 6 |  |
| `params.temperature` | `number` | `0.7` | 0 – 1.5 |  |
| `params.top_k` | `integer \| null` |  | ≥ 1 | Backbone top-k; null = full vocabulary. |

### Response 200 — `ConverseResponse`

| Field | Type | Default | Constraints | Description |
|---|---|---|---|---|
| `model` *(required)* | `string` |  |  |  |
| `reply` *(required)* | `ConverseReply` |  |  |  |
| `reply.speaker` *(required)* | `string` |  |  |  |
| `reply.text` *(required)* | `string` |  |  |  |
| `reply.audio` | `AudioPayload \| null` |  |  |  |
| `reply.audio.data_b64` *(required)* | `string` |  |  | Base64-encoded 16-bit PCM WAV (mono). |
| `reply.audio.num_quantizers` *(required)* | `integer` |  |  | Number of RVQ levels decoded into this audio. |
| `reply.audio.sample_rate` *(required)* | `integer` |  |  | Sample rate of the audio in Hz. |
| `reply.audio.format` | `string` | `"wav"` |  | Container/encoding of `data_b64` (16-bit PCM WAV). |
| `request_id` *(required)* | `string` |  |  |  |
| `usage` *(required)* | `Usage` |  |  |  |
| `usage.input_audio_seconds` | `number` | `0` |  | Seconds of input audio supplied (converse). |
| `usage.input_chars` | `integer` | `0` |  | Characters of input text billed for this request. |
| `usage.output_audio_seconds` | `number` | `0` |  | Seconds of audio generated. |
| `meta` | `object` |  |  |  |

Errors: `401`, `429`, `502` (+ `422` on schema violations); see [Rate limits & errors](/rate-limits-and-errors).

```bash
curl -s https://api.kalpalabs.ai/v1/converse \
  -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
  -d '{"conversation": [{"speaker": "0", "text": "Hi, who are you?"}, {"speaker": "1"}]}'
```

## GET /v1/models

**List available public models.**

### Response 200 — `ModelsResponse`

| Field | Type | Default | Constraints | Description |
|---|---|---|---|---|
| `data` *(required)* | `ModelCard[]` |  |  | The available public models. |
| `data[].display_name` *(required)* | `string` |  |  | Human-readable model name. |
| `data[].id` *(required)* | `string` |  |  | Stable public model id used in the `model` request field. |
| `data[].modes` *(required)* | `string[]` |  |  | Supported modes: subset of ["converse", "tts"]. |
| `data[].speakers` *(required)* | `string[]` |  |  | Valid role labels for a turn's `speaker`, in turn order (e.g. ["0", "1"]). |
| `data[].default` | `boolean` | `false` |  | True for the model used when `model` is omitted. |
| `data[].description` | `string` | `""` |  | What this model is for. |

```bash
curl -s https://api.kalpalabs.ai/v1/models -H "Authorization: Bearer $KALPA_API_KEY"
```

## GET /v1/info

**Backend info, default params, and limits.**

### Response 200 — `InfoResponse`

| Field | Type | Default | Constraints | Description |
|---|---|---|---|---|
| `backend` *(required)* | `object` |  |  | Active backend description (name, kind, sample_rate, …). |
| `defaults` *(required)* | `object` |  |  | Default generation params. |
| `limits` *(required)* | `object` |  |  | Request-validation caps the gateway enforces. |
| `param_schema` *(required)* | `object[]` |  |  | UI metadata for the generation knobs. |

```bash
curl -s https://api.kalpalabs.ai/v1/info -H "Authorization: Bearer $KALPA_API_KEY"
```

## GET /v1/usage

**Your metered usage.**

Running totals (requests, input characters, audio seconds) for the calling API key.

### Response 200 — `UsageSummaryResponse`

| Field | Type | Default | Constraints | Description |
|---|---|---|---|---|
| `input_audio_seconds` *(required)* | `number` |  |  |  |
| `input_chars` *(required)* | `integer` |  |  |  |
| `key_id` *(required)* | `string` |  |  |  |
| `output_audio_seconds` *(required)* | `number` |  |  |  |
| `requests` *(required)* | `integer` |  |  |  |
| `last_request_ts` | `number \| null` |  |  |  |

Errors: `401` (+ `422` on schema violations); see [Rate limits & errors](/rate-limits-and-errors).

```bash
curl -s https://api.kalpalabs.ai/v1/usage -H "Authorization: Bearer $KALPA_API_KEY"
```

## GET /health

**Liveness probe.** No authentication.

### Response 200 — `HealthResponse`

| Field | Type | Default | Constraints | Description |
|---|---|---|---|---|
| `backend` *(required)* | `string` |  |  |  |
| `ready` *(required)* | `boolean` |  |  |  |
| `status` | `string` | `"ok"` |  |  |

```bash
curl -s https://api.kalpalabs.ai/health
```