LLM Evaluation & Optimization Platform

Your agents need quality gates,
not just logging

Production eval infrastructure: judge scoring with custom criteria, smart model routing, ground truth packs, and 3 autonomous optimizer workers. Any AI agent sends content — gets back a structured verdict.

Get API Access /llms.txt
21
Capabilities
3
Optimizer Workers
250
Tests Passed
38
CLI Commands
13d
M1 to DeepEval
positioning

Not a logger. An eval platform.

Loggers collect data. Dashboards visualize it. h2t-evals scores, gates, compares, and autonomously improves your prompts. The verdict is a decision, not a number.

Passive Eval Tools

  • Log LLM inputs/outputs
  • Single-model scoring
  • No optimization loop
  • Manual prompt iteration
  • No ground truth calibration
  • Human reviews everything

h2t-evals

  • Custom criteria profiles with weighted scoring
  • Smart Router: auto-select by cost/quality/latency
  • 3 optimizer workers improve prompts autonomously
  • Ground truth packs as training signal
  • DeepEval RAG + Agentic metrics built-in
  • Candidate lifecycle: approve → lineage → deploy
architecture

Evals owns execution. Graphs owns knowledge.

3-layer ownership model (ADR-002). Clean boundary — h2t-evals runs, scores, and optimizes. h2t-graphs stores artifacts, lineage, and versions. Clients instrument their own code.

graph LR
    subgraph CLIENTS ["CLIENTS"]
        C1["creative-thinking\nframework eval"] & C2["h2t-transcription\nenrichment quality"] & C3["CI pipelines\nregression gates"]
    end
    subgraph EVALS ["h2t-evals"]
        JE["Judge\nEngine"] & SR["Smart\nRouter"] & OPT["Optimizer\nWorkers"] & DE["DeepEval\nMetrics"]
    end
    subgraph GRAPHS ["h2t-graphs"]
        LN["Lineage"] & VR["Versions"] & AR["Artifacts"]
    end
    CLIENTS -->|"eval request"| EVALS
    EVALS -->|"verdict"| CLIENTS
    EVALS -->|"lineage on approve"| GRAPHS
    style EVALS fill:#0a1a0d,stroke:#00ff88,color:#00ff88
    style CLIENTS fill:#0a1a0d,stroke:#00ff88,color:#c0c0d0
    style GRAPHS fill:#0a0d1a,stroke:#4a9eff,color:#c0c0d0

Currently serving: creative-thinking (active), h2t-transcription (onboarding M13)

judge engine

Four scoring modes. One API call.

Custom criteria with weights, multi-tier model routing, batch evaluation, and DeepEval metrics — all through the same endpoint.

LLM-as-Judge

Custom criteria profiles with per-criteria weights, rationale, and pass/fail verdict. TOML or JSON config. Score scale, gate mode (advisory/blocking).

Smart Router

Auto-select model tier by budget. 3 presets: cost_optimized, balanced, quality_first. Or set max_cost_usd directly.

Batch Evaluation

Up to 50 items per request. on_error=continue|fail. Concurrent execution with per-item cost tracking. CI integration ready.

DeepEval Metrics

RAG: Faithfulness, ContextualRelevancy, AnswerRelevancy. Agentic: ToolCorrectness, TaskCompletion. No custom evaluators needed.

autonomous optimization

Three optimizer workers. Zero manual prompt tuning.

Each worker takes a baseline prompt + ground truth pack, produces optimized candidates. You approve or reject. No manual A/B testing.

01

MIPROv2 (DSPy)

DSPy-based prompt optimization. Uses GT pack as training set. Best quality, highest cost. For critical prompts.

02

GEPA

Genetic-Pareto with reflection LM for mutation guidance. Balanced quality/cost tradeoff. Good for iteration.

03

auto-research

Zero-DSPy reflect→mutate→eval loop. No heavy deps, calls the live service. Cheapest, fastest iterations.

Candidate lifecycle — optimizer produces → pending → approve/reject → lineage write to h2t-graphs
Orchestrator — unified dispatch: POST /v1/optimizer/jobs with method + config
PG job queue — persistent queue, SIGTERM-safe workers, retry with backoff
Baseline contract — two modes: inline_text (prompt in request) or external_ref (resolved from h2t-graphs)

ground truth

Labeled test packs. Strong training signal.

GT packs are the foundation for optimizer workers. Strict lifecycle: init → label → freeze → publish. Web UI for labeling. Keyboard shortcuts for speed.

01

Init + Add Cases

Create pack, add test cases with input, expected output, and optional context (RAG) or tools_called (Agentic).

02

Label (Web UI)

Web labeler with keyboard shortcuts. Rate each case. Filter unlabeled. Batch operations.

03

Freeze + Publish

Freeze locks the pack. Publish to DB. Optimizer workers use it as training set.

competitive landscape

Why an active eval platform matters

Passive tools observe. Active platforms improve. h2t-evals closes the loop: score → optimize → approve → deploy.

Tool Custom Judge Optimization GT Packs Cost Control Candidate Lifecycle
LangSmith Limited No No No No
Braintrust Yes No Datasets No No
DeepEval Yes No Test cases No No
PromptFoo Yes Manual Test suites No No
h2t-evals Custom criteria + weights 3 workers (auto) Full lifecycle Smart Router Approve → lineage

Differentiator: Autonomous optimization with candidate lifecycle. Not just scoring — continuous improvement with lineage tracking.

ecosystem

Part of a self-improving platform

h2t-evals is the quality layer in a larger feedback loop. Evaluation feeds optimization, optimization feeds knowledge, knowledge feeds agents.

graph LR
    subgraph AGENTS ["AI AGENTS"]
        A1["DCC Skill"] & A2["Creative"] & A3["Transcription"]
    end
    subgraph EVALS ["h2t-evals"]
        E1["Score"] & E2["Compare"] & E3["Optimize"]
    end
    subgraph GRAPHS ["h2t-graphs"]
        G1["Knowledge"] & G2["Lineage"] & G3["Versions"]
    end
    AGENTS -->|"eval request"| EVALS
    EVALS -->|"improved prompts"| AGENTS
    EVALS -->|"lineage"| GRAPHS
    GRAPHS -->|"context"| AGENTS
    style EVALS fill:#0a1a0d,stroke:#00ff88,color:#00ff88
    style AGENTS fill:#0a1a0d,stroke:#00ff88,color:#c0c0d0
    style GRAPHS fill:#0a0d1a,stroke:#4a9eff,color:#c0c0d0

h2t-evals

  • Judge scoring with custom criteria
  • 3 autonomous optimizer workers
  • Ground truth packs + labeler UI
  • Cost-optimized smart routing

h2t-graphs

  • Schema-guided knowledge engine
  • Provenance & confidence tracking
  • 6-phase enrichment pipeline
  • Semantic + keyword search
telemetry

By the Numbers


250
Tests Passed
21
Capabilities
3
Optimizer Workers
38
CLI Subcommands
<$0.01
Judge Cost / Run
13d
M1 to DeepEval
3
ADRs
99%+
CI Green Rate
roadmap

Full Journey

7 milestones delivered in 13 days. From zero to autonomous optimization platform.

M1-M7

Foundation closed apr 1

Core API, SDK, judge profiles, scorecard, smart router, multi-provider support (local CLI, api-sync, api-batch). 52 issues closed.

M8

Ground Truth & Improvement Loop closed apr 8

GT pack lifecycle, label UI with keyboard shortcuts, benchmark-models worker, optimizer interface, candidate contract.

M9

Scale & Cost Optimization closed apr 10

Smart Router with 3 presets, prompt cache/dedup, cost reporting, judge cost log.

M10

Optimizer Workers closed apr 10

DSPy MIPROv2 worker, benchmark-models worker. Full autonomous optimization pipeline.

M11

Consumer Integration closed apr 12

Batch endpoint (50 items), judge-runs history API, score progress timeline, candidate approve/reject, context JSONB, repo-scoped tokens.

M12

Autonomous Loop closed apr 12

GEPA optimizer, auto-research (zero-DSPy loop), optimizer orchestrator, snapshot resolver, baseline contract (ADR-0003).

DE

DeepEval Integration closed apr 13

DeepEvalProvider (G-Eval tier), RAG + Agentic metrics runner, scoring_notes field, deepeval-run-metrics CLI, optional dep.

M13

Scale & Multi-Tenant current

PG job queue for optimizer workers. h2t-transcription onboarding as second production client. Multi-tenant isolation.

Dashboard & Webhooks planned

Web dashboard for eval trends. Webhook callbacks on score regression. Alerting integration.

stack

Built for solo + AI teams

Lightweight. No Kubernetes. No managed services. One VPS, single-process Flask. 250 tests. CI/CD: push main → deploy.

Python 3.11
Flask
PostgreSQL
psycopg3
DSPy (MIPROv2)
DeepEval
Smart Router
Claude Code
GitHub Actions
Hetzner VPS
Token Auth (RO/RW)
pytest

Get API Access

All endpoints require an X-H2T-Token header. Two token types: wildcard (admin) and repo-scoped (restricted to own profiles).

Judge a response:

curl -X POST https://evals.hou2touch.ai/v1/judge/execute \
  -H "X-H2T-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"profile_id": "my-v1", "assembled_prompt": "...", "router_policy": "balanced"}'

→ {"verdict": "pass", "overall_score": 8.2, "cost_usd": 0.003, "latency_ms": 1240}

Autonomous optimization (CLI):

h2t-evals auto-research \
  --profile-id my-v1 \
  --baseline-text "You are an expert..." \
  --pack-db-id 3 \
  --max-iterations 30 \
  --service-url https://evals.hou2touch.ai \
  --token $TOKEN

→ Best candidate: delta +1.6, registered as pending

Score timeline:

curl "https://evals.hou2touch.ai/v1/progress?profile_id=my-v1&window=30d" \
  -H "X-H2T-Token: $TOKEN"

For API access, contact @prcdrl on Telegram.