# h2t-evals LLM Evaluation & Autonomous Optimization Platform. Production eval infrastructure for the Hou2Touch AI ecosystem: judge scoring, ground truth packs, and autonomous prompt optimization. **Production API:** https://evals.hou2touch.ai **Auth:** `X-H2T-Token: ` header on all endpoints > This file describes h2t-evals for AI agents and LLM-powered tooling. For human-readable docs, see the [API Guide](docs/client/api-guide.md). --- ## Quick start ```bash # 1. Judge a response synchronously curl -X POST https://evals.hou2touch.ai/v1/judge/execute \ -H "X-H2T-Token: $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "profile_id": "my-repo-v1", "assembled_prompt": "Evaluate this response: ...", "repo": "my-repo", "router_policy": "balanced" }' # 2. Get score history curl "https://evals.hou2touch.ai/v1/progress?profile_id=my-repo-v1&repo=my-repo&window=30d" \ -H "X-H2T-Token: $TOKEN" # 3. Check health curl https://evals.hou2touch.ai/health ``` --- ## Core concepts **Judge profile** — TOML/JSON config defining criteria, weights, score scale, and execution tier. Register with `POST /v1/admin/judge-profiles`. Required fields: `profile_id`, `owner_repo`, `criteria[]`, `score_scale`, `execution`. **Session** — eval run from one agent invocation. Open with `POST /v1/sessions/start`, close with `POST /v1/sessions/{id}/end`. Sessions carry `repo`, `framework`, `source` (e.g. `td-expert:v3`), and optional `context` JSONB for client metadata. **Judge run** — one LLM-as-Judge evaluation. Stored in `judge_runs` with `verdict` (pass/fail), `overall_score`, `criteria[]`, `cost_usd`, `latency_ms`. Queryable via `GET /v1/judge-runs`. **Ground Truth pack** — labeled test dataset. Cases have `input` dict with `assembled_prompt`, `model_output`, and optional `context` (RAG) or `tools_called` (Agentic). Used by optimizer workers. **Candidate** — optimizer output (improved prompt). Status: `pending → approved/rejected`. Approve triggers lineage write to h2t-graphs. --- ## Judge execution ### Single judge run ``` POST /v1/judge/execute ``` ```json { "profile_id": "my-repo-v1", "assembled_prompt": "...", "repo": "my-repo", "router_policy": "balanced" } ``` `router_policy` options: `"cost_optimized"`, `"balanced"`, `"quality_first"`, or `{"max_cost_usd": 0.005}`. Response: ```json { "ok": true, "verdict": "pass", "overall_score": 8.2, "confidence": 0.85, "criteria": [{"name": "quality", "score": 8.5, "weight": 0.5, "rationale": "..."}], "cost_usd": 0.003, "latency_ms": 1240 } ``` ### Batch judge (up to 50 items) ``` POST /v1/judge/execute/batch ``` ```json { "profile_id": "my-repo-v1", "repo": "my-repo", "items": [ {"item_id": "i1", "assembled_prompt": "...", "session_id": "sess-001"} ], "on_error": "continue" } ``` `on_error`: `"continue"` (default) or `"fail"` (abort on first error). --- ## Score history and progress ``` GET /v1/progress?profile_id=&repo=&window=30d&group_by=version ``` Returns timeline of scores grouped by `source_version` or `day`. Includes `trend` (delta between first and last day average). ``` GET /v1/judge-runs?profile_id=&verdict=fail&limit=50 ``` Filters: `verdict` (pass/fail), `source_version`, `from`/`to` (ISO datetime), `limit` (1–200), `offset`. --- ## Judge profile registration ``` POST /v1/admin/judge-profiles ``` ```json { "profile_id": "my-repo-v1", "owner_repo": "my-repo", "score_scale": {"min": 0, "max": 10}, "gate_mode": "advisory", "criteria": [ {"name": "quality", "weight": 0.5}, {"name": "relevance", "weight": 0.3}, {"name": "accuracy", "weight": 0.2} ], "execution": { "default_tier": "api-sync", "default_model": "claude-sonnet-4-6" } } ``` `gate_mode`: `"advisory"` (score only) or `"blocking"` (fail if below threshold). Repo-scoped tokens can only register profiles where `owner_repo` matches the token's repo. --- ## Sessions (SDK integration) Open a session from your agent/pipeline: ``` POST /v1/sessions/start ``` ```json { "session_id": "run-001", "repo": "my-repo", "framework": "td-expert", "source": "td:v3", "host": "ci-runner", "run_env": "ci", "eval_set_id": "es-v1", "schema_version": "2.0", "sdk_version": "0.1.0", "metric_set_version": "core-1", "client_event_at": "2026-04-13T10:00:00+00:00", "context": {"frameworks_selected": ["CHOP"], "aspect": "translation"} } ``` `context` JSONB stores arbitrary client metadata (e.g. which framework was evaluated, aspect of the task). Close: ``` POST /v1/sessions/{id}/end {"status": "success", ...} ``` --- ## Ground Truth packs Create and populate a GT pack for optimizer training: ```bash # CLI workflow h2t-evals gt-init --pack-id my-pack-v1 --owner my-repo h2t-evals gt-label --pack-dir ./packs/my-pack-v1 # terminal labeling h2t-evals gt-label-ui --pack-dir ./packs/my-pack-v1 # web UI h2t-evals gt-freeze --pack-dir ./packs/my-pack-v1 h2t-evals gt-publish --pack-dir ./packs/my-pack-v1 --dsn "postgresql://..." ``` Case format for RAG evaluation: ```json { "input": { "assembled_prompt": "question", "model_output": "answer from model", "context": ["retrieved chunk 1", "retrieved chunk 2"] } } ``` Case format for Agentic evaluation: ```json { "input": { "assembled_prompt": "task description", "model_output": "agent response", "tools_called": [{"name": "search", "args": {"query": "..."}}], "expected_tools": [{"name": "search"}] } } ``` --- ## Optimizer workers Three methods, all register candidates at `POST /api/v1/gt/packs/{id}/candidates`: ```bash # DSPy MIPROv2 (best quality, requires [optimizer] extra) h2t-evals optimize-judge --method miprov2 --pack-db-id 3 --profile-id my-v1 \ --dspy-model claude-sonnet-4-6 --baseline-text "..." --dsn "..." # GEPA (reflection LM mutations, requires [optimizer] extra) h2t-evals optimize-judge --method gepa --pack-db-id 3 --profile-id my-v1 \ --dspy-model claude-sonnet-4-6 --reflection-model claude-opus-4-6 \ --baseline-text "..." --dsn "..." # auto-research (no DSPy, calls the live service) h2t-evals auto-research --profile-id my-v1 --pack-db-id 3 \ --execution-model claude-sonnet-4-6 --baseline-text "..." \ --service-url https://evals.hou2touch.ai --token $TOKEN ``` Approve or reject candidates: ``` POST /api/v1/gt/packs/{id}/candidates/{cid}/approve POST /api/v1/gt/packs/{id}/candidates/{cid}/reject {"reason": "improvement too small"} ``` --- ## DeepEval metrics (optional) RAG and Agentic metrics via DeepEval library. Requires `pip install h2t-evals[deepeval]`. ```bash h2t-evals deepeval-run-metrics \ --pack-db-id 3 \ --metric-set rag \ # rag | agentic | rag,agentic --model claude-sonnet-4-6 \ --profile-id my-repo-v1 \ --dsn "postgresql://..." ``` Results stored as `judge_runs` with `judge_model=deepeval:MetricClassName` and `input_ref=pack:{id}`. To use G-Eval as judge tier in a profile: ```toml [execution] default_tier = "deepeval-local" default_model = "claude-sonnet-4-6" ``` --- ## Authentication Two token types: - **Wildcard** (`repo = "*"`) — full access, admin operations - **Repo-scoped** (`repo = "my-repo"`) — restricted to own repo's sessions and profiles Create tokens: ```bash h2t-evals token-create --dsn "postgresql://..." --repo "*" # admin h2t-evals token-create --dsn "postgresql://..." --repo my-repo # scoped ``` --- ## Error codes | Code | HTTP | Description | |------|------|-------------| | `E_AUTH_REQUIRED` | 401 | Missing token | | `E_AUTH_INVALID` | 401 | Invalid token | | `E_FORBIDDEN` | 403 | Token repo mismatch | | `E_VALIDATION` | 400 | Invalid parameters | | `E_JUDGE_PROFILE_UNKNOWN` | 400 | Profile not found | | `E_BATCH_PARTIAL_FAILURE` | 502 | `on_error=fail` + error in batch | | `E_ALREADY_REVIEWED` | 409 | Candidate already approved/rejected | | `E_INTERNAL` | 500 | Internal error | --- ## Health ``` GET /health → {"ok": true, "status": "healthy"} GET /ready → detailed status (DB, recent events) GET /llms.txt → this file ```