# h2t-evals

LLM Evaluation & Autonomous Optimization Platform. Production eval infrastructure for the Hou2Touch AI ecosystem: judge scoring, ground truth packs, and autonomous prompt optimization.

**Production API:** https://evals.hou2touch.ai
**Auth:** `X-H2T-Token: <token>` header on all endpoints

> This file describes h2t-evals for AI agents and LLM-powered tooling. For human-readable docs, see the [API Guide](docs/client/api-guide.md).

---

## Quick start

```bash
# 1. Judge a response synchronously
curl -X POST https://evals.hou2touch.ai/v1/judge/execute \
  -H "X-H2T-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "profile_id": "my-repo-v1",
    "assembled_prompt": "Evaluate this response: ...",
    "repo": "my-repo",
    "router_policy": "balanced"
  }'

# 2. Get score history
curl "https://evals.hou2touch.ai/v1/progress?profile_id=my-repo-v1&repo=my-repo&window=30d" \
  -H "X-H2T-Token: $TOKEN"

# 3. Check health
curl https://evals.hou2touch.ai/health
```

---

## Core concepts

**Judge profile** — TOML/JSON config defining criteria, weights, score scale, and execution tier. Register with `POST /v1/admin/judge-profiles`. Required fields: `profile_id`, `owner_repo`, `criteria[]`, `score_scale`, `execution`.

**Session** — eval run from one agent invocation. Open with `POST /v1/sessions/start`, close with `POST /v1/sessions/{id}/end`. Sessions carry `repo`, `framework`, `source` (e.g. `td-expert:v3`), and optional `context` JSONB for client metadata.

**Judge run** — one LLM-as-Judge evaluation. Stored in `judge_runs` with `verdict` (pass/fail), `overall_score`, `criteria[]`, `cost_usd`, `latency_ms`. Queryable via `GET /v1/judge-runs`.

**Ground Truth pack** — labeled test dataset. Cases have `input` dict with `assembled_prompt`, `model_output`, and optional `context` (RAG) or `tools_called` (Agentic). Used by optimizer workers.

**Candidate** — optimizer output (improved prompt). Status: `pending → approved/rejected`. Approve triggers lineage write to h2t-graphs.

---

## Judge execution

### Single judge run

```
POST /v1/judge/execute
```

```json
{
  "profile_id": "my-repo-v1",
  "assembled_prompt": "...",
  "repo": "my-repo",
  "router_policy": "balanced"
}
```

`router_policy` options: `"cost_optimized"`, `"balanced"`, `"quality_first"`, or `{"max_cost_usd": 0.005}`.

Response:
```json
{
  "ok": true,
  "verdict": "pass",
  "overall_score": 8.2,
  "confidence": 0.85,
  "criteria": [{"name": "quality", "score": 8.5, "weight": 0.5, "rationale": "..."}],
  "cost_usd": 0.003,
  "latency_ms": 1240
}
```

### Batch judge (up to 50 items)

```
POST /v1/judge/execute/batch
```

```json
{
  "profile_id": "my-repo-v1",
  "repo": "my-repo",
  "items": [
    {"item_id": "i1", "assembled_prompt": "...", "session_id": "sess-001"}
  ],
  "on_error": "continue"
}
```

`on_error`: `"continue"` (default) or `"fail"` (abort on first error).

---

## Score history and progress

```
GET /v1/progress?profile_id=<id>&repo=<repo>&window=30d&group_by=version
```

Returns timeline of scores grouped by `source_version` or `day`. Includes `trend` (delta between first and last day average).

```
GET /v1/judge-runs?profile_id=<id>&verdict=fail&limit=50
```

Filters: `verdict` (pass/fail), `source_version`, `from`/`to` (ISO datetime), `limit` (1–200), `offset`.

---

## Judge profile registration

```
POST /v1/admin/judge-profiles
```

```json
{
  "profile_id": "my-repo-v1",
  "owner_repo": "my-repo",
  "score_scale": {"min": 0, "max": 10},
  "gate_mode": "advisory",
  "criteria": [
    {"name": "quality", "weight": 0.5},
    {"name": "relevance", "weight": 0.3},
    {"name": "accuracy", "weight": 0.2}
  ],
  "execution": {
    "default_tier": "api-sync",
    "default_model": "claude-sonnet-4-6"
  }
}
```

`gate_mode`: `"advisory"` (score only) or `"blocking"` (fail if below threshold).

Repo-scoped tokens can only register profiles where `owner_repo` matches the token's repo.

---

## Sessions (SDK integration)

Open a session from your agent/pipeline:

```
POST /v1/sessions/start
```

```json
{
  "session_id": "run-001",
  "repo": "my-repo",
  "framework": "td-expert",
  "source": "td:v3",
  "host": "ci-runner",
  "run_env": "ci",
  "eval_set_id": "es-v1",
  "schema_version": "2.0",
  "sdk_version": "0.1.0",
  "metric_set_version": "core-1",
  "client_event_at": "2026-04-13T10:00:00+00:00",
  "context": {"frameworks_selected": ["CHOP"], "aspect": "translation"}
}
```

`context` JSONB stores arbitrary client metadata (e.g. which framework was evaluated, aspect of the task).

Close:
```
POST /v1/sessions/{id}/end
{"status": "success", ...}
```

---

## Ground Truth packs

Create and populate a GT pack for optimizer training:

```bash
# CLI workflow
h2t-evals gt-init --pack-id my-pack-v1 --owner my-repo
h2t-evals gt-label --pack-dir ./packs/my-pack-v1    # terminal labeling
h2t-evals gt-label-ui --pack-dir ./packs/my-pack-v1  # web UI
h2t-evals gt-freeze --pack-dir ./packs/my-pack-v1
h2t-evals gt-publish --pack-dir ./packs/my-pack-v1 --dsn "postgresql://..."
```

Case format for RAG evaluation:
```json
{
  "input": {
    "assembled_prompt": "question",
    "model_output": "answer from model",
    "context": ["retrieved chunk 1", "retrieved chunk 2"]
  }
}
```

Case format for Agentic evaluation:
```json
{
  "input": {
    "assembled_prompt": "task description",
    "model_output": "agent response",
    "tools_called": [{"name": "search", "args": {"query": "..."}}],
    "expected_tools": [{"name": "search"}]
  }
}
```

---

## Optimizer workers

Three methods, all register candidates at `POST /api/v1/gt/packs/{id}/candidates`:

```bash
# DSPy MIPROv2 (best quality, requires [optimizer] extra)
h2t-evals optimize-judge --method miprov2 --pack-db-id 3 --profile-id my-v1 \
  --dspy-model claude-sonnet-4-6 --baseline-text "..." --dsn "..."

# GEPA (reflection LM mutations, requires [optimizer] extra)
h2t-evals optimize-judge --method gepa --pack-db-id 3 --profile-id my-v1 \
  --dspy-model claude-sonnet-4-6 --reflection-model claude-opus-4-6 \
  --baseline-text "..." --dsn "..."

# auto-research (no DSPy, calls the live service)
h2t-evals auto-research --profile-id my-v1 --pack-db-id 3 \
  --execution-model claude-sonnet-4-6 --baseline-text "..." \
  --service-url https://evals.hou2touch.ai --token $TOKEN
```

Approve or reject candidates:
```
POST /api/v1/gt/packs/{id}/candidates/{cid}/approve
POST /api/v1/gt/packs/{id}/candidates/{cid}/reject
{"reason": "improvement too small"}
```

---

## DeepEval metrics (optional)

RAG and Agentic metrics via DeepEval library. Requires `pip install h2t-evals[deepeval]`.

```bash
h2t-evals deepeval-run-metrics \
  --pack-db-id 3 \
  --metric-set rag \          # rag | agentic | rag,agentic
  --model claude-sonnet-4-6 \
  --profile-id my-repo-v1 \
  --dsn "postgresql://..."
```

Results stored as `judge_runs` with `judge_model=deepeval:MetricClassName` and `input_ref=pack:{id}`.

To use G-Eval as judge tier in a profile:
```toml
[execution]
default_tier = "deepeval-local"
default_model = "claude-sonnet-4-6"
```

---

## Authentication

Two token types:
- **Wildcard** (`repo = "*"`) — full access, admin operations
- **Repo-scoped** (`repo = "my-repo"`) — restricted to own repo's sessions and profiles

Create tokens:
```bash
h2t-evals token-create --dsn "postgresql://..." --repo "*"         # admin
h2t-evals token-create --dsn "postgresql://..." --repo my-repo     # scoped
```

---

## Error codes

| Code | HTTP | Description |
|------|------|-------------|
| `E_AUTH_REQUIRED` | 401 | Missing token |
| `E_AUTH_INVALID` | 401 | Invalid token |
| `E_FORBIDDEN` | 403 | Token repo mismatch |
| `E_VALIDATION` | 400 | Invalid parameters |
| `E_JUDGE_PROFILE_UNKNOWN` | 400 | Profile not found |
| `E_BATCH_PARTIAL_FAILURE` | 502 | `on_error=fail` + error in batch |
| `E_ALREADY_REVIEWED` | 409 | Candidate already approved/rejected |
| `E_INTERNAL` | 500 | Internal error |

---

## Health

```
GET /health  → {"ok": true, "status": "healthy"}
GET /ready   → detailed status (DB, recent events)
GET /llms.txt → this file
```