# Quickstart

> From API key to spoken audio in five minutes: one TTS call, then a conversation.

## 1. Get a key

Keys are provisioned per team while the API is in early access — write to [hello@kalpalabs.ai](mailto:hello@kalpalabs.ai). Keep it server-side and export it where your code runs:

```bash
export KALPA_API_KEY=...
```

## 2. Say something

`POST /v1/tts` takes text and returns spoken audio as base64 WAV. Decode it and you have a playable file:

```bash
curl -s https://api.kalpalabs.ai/v1/tts \
  -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
  -d '{"text": "Hey there! How are you doing today?", "speaker": "0"}' \
  | python3 -c 'import sys,json,base64; r=json.load(sys.stdin); open("out.wav","wb").write(base64.b64decode(r["audio"]["data_b64"])); print("wrote out.wav", r["usage"])'
```

`out.wav` is mono 16-bit PCM at 24 kHz. The same call in Python:

```python
import base64, requests, os

r = requests.post(
    "https://api.kalpalabs.ai/v1/tts",
    headers={"Authorization": f"Bearer {os.environ['KALPA_API_KEY']}"},
    json={"text": "Hey there! How are you doing today?", "speaker": "0"},
)
r.raise_for_status()
reply = r.json()
with open("out.wav", "wb") as f:
    f.write(base64.b64decode(reply["audio"]["data_b64"]))
print(reply["usage"])  # characters in, seconds of audio out
```

## 3. Have a conversation

`POST /v1/converse` completes the **last turn** of a conversation. Here the last turn carries only a `speaker` — so the model authors it: it writes what speaker `"1"` would say next and voices it.

```bash
curl -s https://api.kalpalabs.ai/v1/converse \
  -H "Authorization: Bearer $KALPA_API_KEY" -H 'Content-Type: application/json' \
  -d '{
    "conversation": [
      {"speaker": "0", "text": "Hi, who are you?"},
      {"speaker": "1"}
    ]
  }'
```

```json
{
  "request_id": "…",
  "model": "kalpa-conversational-v1",
  "reply": {
    "speaker": "1",
    "text": "I'm a speech model built by Kalpa Labs…",
    "audio": { "format": "wav", "sample_rate": 24000, "num_quantizers": 32, "data_b64": "…" }
  },
  "usage": { "input_chars": 16, "input_audio_seconds": 0.0, "output_audio_seconds": 3.2 }
}
```

Give the last turn a `text` instead and the model renders exactly that text in context — contextual TTS. The full semantics (spoken history, reference audio, speaker labels) are in [Conversations](/conversations).

## Next

- [Conversations](/conversations) — the open-turn model, audio history, contextual TTS.
- [Text-to-speech](/text-to-speech) — the generation knobs and what they do.
- [Models](/models) — pick a model per request.
- [API reference](/reference) — every field, generated from [openapi.json](/openapi.json).