Production eval infrastructure: judge scoring with custom criteria, smart model routing, ground truth packs, and 3 autonomous optimizer workers. Any AI agent sends content — gets back a structured verdict.
Loggers collect data. Dashboards visualize it. h2t-evals scores, gates, compares, and autonomously improves your prompts. The verdict is a decision, not a number.
3-layer ownership model (ADR-002). Clean boundary — h2t-evals runs, scores, and optimizes. h2t-graphs stores artifacts, lineage, and versions. Clients instrument their own code.
graph LR
subgraph CLIENTS ["CLIENTS"]
C1["creative-thinking\nframework eval"] & C2["h2t-transcription\nenrichment quality"] & C3["CI pipelines\nregression gates"]
end
subgraph EVALS ["h2t-evals"]
JE["Judge\nEngine"] & SR["Smart\nRouter"] & OPT["Optimizer\nWorkers"] & DE["DeepEval\nMetrics"]
end
subgraph GRAPHS ["h2t-graphs"]
LN["Lineage"] & VR["Versions"] & AR["Artifacts"]
end
CLIENTS -->|"eval request"| EVALS
EVALS -->|"verdict"| CLIENTS
EVALS -->|"lineage on approve"| GRAPHS
style EVALS fill:#0a1a0d,stroke:#00ff88,color:#00ff88
style CLIENTS fill:#0a1a0d,stroke:#00ff88,color:#c0c0d0
style GRAPHS fill:#0a0d1a,stroke:#4a9eff,color:#c0c0d0
Currently serving: creative-thinking (active), h2t-transcription (onboarding M13)
Custom criteria with weights, multi-tier model routing, batch evaluation, and DeepEval metrics — all through the same endpoint.
Custom criteria profiles with per-criteria weights, rationale, and pass/fail verdict. TOML or JSON config. Score scale, gate mode (advisory/blocking).
Auto-select model tier by budget. 3 presets: cost_optimized, balanced, quality_first. Or set max_cost_usd directly.
Up to 50 items per request. on_error=continue|fail. Concurrent execution with per-item cost tracking. CI integration ready.
RAG: Faithfulness, ContextualRelevancy, AnswerRelevancy. Agentic: ToolCorrectness, TaskCompletion. No custom evaluators needed.
Each worker takes a baseline prompt + ground truth pack, produces optimized candidates. You approve or reject. No manual A/B testing.
DSPy-based prompt optimization. Uses GT pack as training set. Best quality, highest cost. For critical prompts.
Genetic-Pareto with reflection LM for mutation guidance. Balanced quality/cost tradeoff. Good for iteration.
Zero-DSPy reflect→mutate→eval loop. No heavy deps, calls the live service. Cheapest, fastest iterations.
Candidate lifecycle — optimizer produces → pending → approve/reject → lineage write to h2t-graphs
Orchestrator — unified dispatch: POST /v1/optimizer/jobs with method + config
PG job queue — persistent queue, SIGTERM-safe workers, retry with backoff
Baseline contract — two modes: inline_text (prompt in request) or external_ref (resolved from h2t-graphs)
GT packs are the foundation for optimizer workers. Strict lifecycle: init → label → freeze → publish. Web UI for labeling. Keyboard shortcuts for speed.
Create pack, add test cases with input, expected output, and optional context (RAG) or tools_called (Agentic).
Web labeler with keyboard shortcuts. Rate each case. Filter unlabeled. Batch operations.
Freeze locks the pack. Publish to DB. Optimizer workers use it as training set.
Passive tools observe. Active platforms improve. h2t-evals closes the loop: score → optimize → approve → deploy.
| Tool | Custom Judge | Optimization | GT Packs | Cost Control | Candidate Lifecycle |
|---|---|---|---|---|---|
| LangSmith | Limited | No | No | No | No |
| Braintrust | Yes | No | Datasets | No | No |
| DeepEval | Yes | No | Test cases | No | No |
| PromptFoo | Yes | Manual | Test suites | No | No |
| h2t-evals | Custom criteria + weights | 3 workers (auto) | Full lifecycle | Smart Router | Approve → lineage |
Differentiator: Autonomous optimization with candidate lifecycle. Not just scoring — continuous improvement with lineage tracking.
h2t-evals is the quality layer in a larger feedback loop. Evaluation feeds optimization, optimization feeds knowledge, knowledge feeds agents.
graph LR
subgraph AGENTS ["AI AGENTS"]
A1["DCC Skill"] & A2["Creative"] & A3["Transcription"]
end
subgraph EVALS ["h2t-evals"]
E1["Score"] & E2["Compare"] & E3["Optimize"]
end
subgraph GRAPHS ["h2t-graphs"]
G1["Knowledge"] & G2["Lineage"] & G3["Versions"]
end
AGENTS -->|"eval request"| EVALS
EVALS -->|"improved prompts"| AGENTS
EVALS -->|"lineage"| GRAPHS
GRAPHS -->|"context"| AGENTS
style EVALS fill:#0a1a0d,stroke:#00ff88,color:#00ff88
style AGENTS fill:#0a1a0d,stroke:#00ff88,color:#c0c0d0
style GRAPHS fill:#0a0d1a,stroke:#4a9eff,color:#c0c0d0
7 milestones delivered in 13 days. From zero to autonomous optimization platform.
Core API, SDK, judge profiles, scorecard, smart router, multi-provider support (local CLI, api-sync, api-batch). 52 issues closed.
GT pack lifecycle, label UI with keyboard shortcuts, benchmark-models worker, optimizer interface, candidate contract.
Smart Router with 3 presets, prompt cache/dedup, cost reporting, judge cost log.
DSPy MIPROv2 worker, benchmark-models worker. Full autonomous optimization pipeline.
Batch endpoint (50 items), judge-runs history API, score progress timeline, candidate approve/reject, context JSONB, repo-scoped tokens.
GEPA optimizer, auto-research (zero-DSPy loop), optimizer orchestrator, snapshot resolver, baseline contract (ADR-0003).
DeepEvalProvider (G-Eval tier), RAG + Agentic metrics runner, scoring_notes field, deepeval-run-metrics CLI, optional dep.
PG job queue for optimizer workers. h2t-transcription onboarding as second production client. Multi-tenant isolation.
Web dashboard for eval trends. Webhook callbacks on score regression. Alerting integration.
Lightweight. No Kubernetes. No managed services. One VPS, single-process Flask. 250 tests. CI/CD: push main → deploy.
All endpoints require an X-H2T-Token header. Two token types: wildcard (admin) and repo-scoped (restricted to own profiles).
Judge a response:
Autonomous optimization (CLI):
Score timeline:
For API access, contact @prcdrl on Telegram.