# Models

> The public model registry, per-request model switching, and what it costs in latency.

The API exposes stable public model ids that hide checkpoints and infrastructure. List them:

```bash
curl -s https://api.kalpalabs.ai/v1/models -H "Authorization: Bearer $KALPA_API_KEY"
```

```json
{
  "data": [
    {
      "id": "kalpa-conversational-v1",
      "display_name": "Kalpa Conversational v1",
      "description": "Flagship multi-speaker conversational speech model (TTS + converse).",
      "modes": ["converse", "tts"],
      "speakers": ["0", "1"],
      "default": true
    },
    { "id": "kalpa-conversational-8b",   "modes": ["converse", "tts"], "speakers": ["0", "1"], "default": false },
    { "id": "kalpa-conversational-mini", "modes": ["converse", "tts"], "speakers": ["0", "1"], "default": false }
  ]
}
```

| Model | Use it for |
|---|---|
| `kalpa-conversational-v1` | The flagship — best quality; the default when `model` is omitted. |
| `kalpa-conversational-8b` | 8B variant — quality close to flagship at lower serving cost. |
| `kalpa-conversational-mini` | Compact variant — fastest to load and cheapest to run. |

## Choosing a model per request

Every generation endpoint takes a `model` field:

```json
{ "text": "…", "model": "kalpa-conversational-mini" }
```

- Omit it (or send `null`) for the default model. The response's `model` field always echoes the **resolved** public id, so logs stay unambiguous.
- An unknown id, or a model that doesn't support the endpoint's mode, returns `400 invalid_request`.

## Speakers are per model

Each card's `speakers` lists the role labels that model understands, in turn order. Don't hardcode them — read the card and use its labels. The details (and why wrong labels degrade audio) are in [Conversations](/conversations).

## Switching cost

One model is resident on the accelerator at a time. Requests to the resident model are fast; **the first request after a switch pays the load** — from seconds for the mini model to tens of seconds for the flagship. If your traffic is latency-sensitive, keep it on one model rather than alternating, and expect the first call to a cold model to be slow (set client timeouts accordingly).
