LLM Evaluation & Optimization Platform

Your agents need quality gates,
not just logging

Production eval infrastructure: judge scoring with custom criteria, smart model routing, ground truth packs, and 3 autonomous optimizer workers. Any AI agent sends content — gets back a structured verdict.

Get API Access /llms.txt

Capabilities

Optimizer Workers

250

Tests Passed

CLI Commands

13d

M1 to DeepEval

positioning

Not a logger. An eval platform.

Loggers collect data. Dashboards visualize it. h2t-evals scores, gates, compares, and autonomously improves your prompts. The verdict is a decision, not a number.

Passive Eval Tools

Log LLM inputs/outputs
Single-model scoring
No optimization loop
Manual prompt iteration
No ground truth calibration
Human reviews everything

h2t-evals

Custom criteria profiles with weighted scoring
Smart Router: auto-select by cost/quality/latency
3 optimizer workers improve prompts autonomously
Ground truth packs as training signal
DeepEval RAG + Agentic metrics built-in
Candidate lifecycle: approve → lineage → deploy

architecture

Evals owns execution. Graphs owns knowledge.

3-layer ownership model (ADR-002). Clean boundary — h2t-evals runs, scores, and optimizes. h2t-graphs stores artifacts, lineage, and versions. Clients instrument their own code.

graph LR
    subgraph CLIENTS ["CLIENTS"]
        C1["creative-thinking\nframework eval"] & C2["h2t-transcription\nenrichment quality"] & C3["CI pipelines\nregression gates"]
    end
    subgraph EVALS ["h2t-evals"]
        JE["Judge\nEngine"] & SR["Smart\nRouter"] & OPT["Optimizer\nWorkers"] & DE["DeepEval\nMetrics"]
    end
    subgraph GRAPHS ["h2t-graphs"]
        LN["Lineage"] & VR["Versions"] & AR["Artifacts"]
    end
    CLIENTS -->|"eval request"| EVALS
    EVALS -->|"verdict"| CLIENTS
    EVALS -->|"lineage on approve"| GRAPHS
    style EVALS fill:#0a1a0d,stroke:#00ff88,color:#00ff88
    style CLIENTS fill:#0a1a0d,stroke:#00ff88,color:#c0c0d0
    style GRAPHS fill:#0a0d1a,stroke:#4a9eff,color:#c0c0d0

Currently serving: creative-thinking (active), h2t-transcription (onboarding M13)

judge engine

Four scoring modes. One API call.

Custom criteria with weights, multi-tier model routing, batch evaluation, and DeepEval metrics — all through the same endpoint.

≣

LLM-as-Judge

Custom criteria profiles with per-criteria weights, rationale, and pass/fail verdict. TOML or JSON config. Score scale, gate mode (advisory/blocking).

⇄

Smart Router

Auto-select model tier by budget. 3 presets: cost_optimized, balanced, quality_first. Or set max_cost_usd directly.

✦

Batch Evaluation

Up to 50 items per request. on_error=continue|fail. Concurrent execution with per-item cost tracking. CI integration ready.

❀

DeepEval Metrics

RAG: Faithfulness, ContextualRelevancy, AnswerRelevancy. Agentic: ToolCorrectness, TaskCompletion. No custom evaluators needed.

autonomous optimization

Three optimizer workers. Zero manual prompt tuning.

Each worker takes a baseline prompt + ground truth pack, produces optimized candidates. You approve or reject. No manual A/B testing.

MIPROv2 (DSPy)

DSPy-based prompt optimization. Uses GT pack as training set. Best quality, highest cost. For critical prompts.

GEPA

Genetic-Pareto with reflection LM for mutation guidance. Balanced quality/cost tradeoff. Good for iteration.

auto-research

Zero-DSPy reflect→mutate→eval loop. No heavy deps, calls the live service. Cheapest, fastest iterations.

Candidate lifecycle — optimizer produces → pending → approve/reject → lineage write to h2t-graphs
Orchestrator — unified dispatch: POST /v1/optimizer/jobs with method + config
PG job queue — persistent queue, SIGTERM-safe workers, retry with backoff
Baseline contract — two modes: inline_text (prompt in request) or external_ref (resolved from h2t-graphs)

ground truth

Labeled test packs. Strong training signal.

GT packs are the foundation for optimizer workers. Strict lifecycle: init → label → freeze → publish. Web UI for labeling. Keyboard shortcuts for speed.

Init + Add Cases

Create pack, add test cases with input, expected output, and optional context (RAG) or tools_called (Agentic).

Label (Web UI)

Web labeler with keyboard shortcuts. Rate each case. Filter unlabeled. Batch operations.

Freeze + Publish

Freeze locks the pack. Publish to DB. Optimizer workers use it as training set.

competitive landscape

Why an active eval platform matters

Passive tools observe. Active platforms improve. h2t-evals closes the loop: score → optimize → approve → deploy.

Tool	Custom Judge	Optimization	GT Packs	Cost Control	Candidate Lifecycle
LangSmith	Limited	No	No	No	No
Braintrust	Yes	No	Datasets	No	No
DeepEval	Yes	No	Test cases	No	No
PromptFoo	Yes	Manual	Test suites	No	No
h2t-evals	Custom criteria + weights	3 workers (auto)	Full lifecycle	Smart Router	Approve → lineage

Differentiator: Autonomous optimization with candidate lifecycle. Not just scoring — continuous improvement with lineage tracking.

ecosystem

Part of a self-improving platform

h2t-evals is the quality layer in a larger feedback loop. Evaluation feeds optimization, optimization feeds knowledge, knowledge feeds agents.

graph LR
    subgraph AGENTS ["AI AGENTS"]
        A1["DCC Skill"] & A2["Creative"] & A3["Transcription"]
    end
    subgraph EVALS ["h2t-evals"]
        E1["Score"] & E2["Compare"] & E3["Optimize"]
    end
    subgraph GRAPHS ["h2t-graphs"]
        G1["Knowledge"] & G2["Lineage"] & G3["Versions"]
    end
    AGENTS -->|"eval request"| EVALS
    EVALS -->|"improved prompts"| AGENTS
    EVALS -->|"lineage"| GRAPHS
    GRAPHS -->|"context"| AGENTS
    style EVALS fill:#0a1a0d,stroke:#00ff88,color:#00ff88
    style AGENTS fill:#0a1a0d,stroke:#00ff88,color:#c0c0d0
    style GRAPHS fill:#0a0d1a,stroke:#4a9eff,color:#c0c0d0

h2t-evals

Judge scoring with custom criteria
3 autonomous optimizer workers
Ground truth packs + labeler UI
Cost-optimized smart routing

h2t-graphs

Schema-guided knowledge engine
Provenance & confidence tracking
6-phase enrichment pipeline
Semantic + keyword search

roadmap

Full Journey

7 milestones delivered in 13 days. From zero to autonomous optimization platform.

M1-M7

Foundation closed apr 1

Core API, SDK, judge profiles, scorecard, smart router, multi-provider support (local CLI, api-sync, api-batch). 52 issues closed.

Ground Truth & Improvement Loop closed apr 8

GT pack lifecycle, label UI with keyboard shortcuts, benchmark-models worker, optimizer interface, candidate contract.

Scale & Cost Optimization closed apr 10

Smart Router with 3 presets, prompt cache/dedup, cost reporting, judge cost log.

M10

Optimizer Workers closed apr 10

DSPy MIPROv2 worker, benchmark-models worker. Full autonomous optimization pipeline.

M11

Consumer Integration closed apr 12

Batch endpoint (50 items), judge-runs history API, score progress timeline, candidate approve/reject, context JSONB, repo-scoped tokens.

M12

Autonomous Loop closed apr 12

GEPA optimizer, auto-research (zero-DSPy loop), optimizer orchestrator, snapshot resolver, baseline contract (ADR-0003).

DeepEval Integration closed apr 13

DeepEvalProvider (G-Eval tier), RAG + Agentic metrics runner, scoring_notes field, deepeval-run-metrics CLI, optional dep.

M13

Scale & Multi-Tenant current

PG job queue for optimizer workers. h2t-transcription onboarding as second production client. Multi-tenant isolation.

M14

Dashboard & Webhooks planned

Web dashboard for eval trends. Webhook callbacks on score regression. Alerting integration.

access

Get API Access

All endpoints require an X-H2T-Token header. Two token types: wildcard (admin) and repo-scoped (restricted to own profiles).

Judge a response:

curl -X POST https://evals.hou2touch.ai/v1/judge/execute \
  -H "X-H2T-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"profile_id": "my-v1", "assembled_prompt": "...", "router_policy": "balanced"}'

→ {"verdict": "pass", "overall_score": 8.2, "cost_usd": 0.003, "latency_ms": 1240}

Autonomous optimization (CLI):

h2t-evals auto-research \
  --profile-id my-v1 \
  --baseline-text "You are an expert..." \
  --pack-db-id 3 \
  --max-iterations 30 \
  --service-url https://evals.hou2touch.ai \
  --token $TOKEN

→ Best candidate: delta +1.6, registered as pending

Score timeline:

curl "https://evals.hou2touch.ai/v1/progress?profile_id=my-v1&window=30d" \
  -H "X-H2T-Token: $TOKEN"

For API access, contact @prcdrl on Telegram.

Your agents need quality gates,
not just logging

Not a logger. An eval platform.

Passive Eval Tools

h2t-evals

Evals owns execution. Graphs owns knowledge.

Four scoring modes. One API call.

LLM-as-Judge

Smart Router

Batch Evaluation

DeepEval Metrics

Three optimizer workers. Zero manual prompt tuning.

MIPROv2 (DSPy)

GEPA

auto-research

Labeled test packs. Strong training signal.

Init + Add Cases

Label (Web UI)

Freeze + Publish

Why an active eval platform matters

Part of a self-improving platform

h2t-evals

h2t-graphs

By the Numbers

Full Journey

Foundation closed apr 1

Ground Truth & Improvement Loop closed apr 8

Scale & Cost Optimization closed apr 10

Optimizer Workers closed apr 10

Consumer Integration closed apr 12

Autonomous Loop closed apr 12

DeepEval Integration closed apr 13

Scale & Multi-Tenant current

Dashboard & Webhooks planned

Built for solo + AI teams

Get API Access

Your agents need quality gates,not just logging

Not a logger. An eval platform.

Passive Eval Tools

h2t-evals

Evals owns execution. Graphs owns knowledge.

Four scoring modes. One API call.

LLM-as-Judge

Smart Router

Batch Evaluation

DeepEval Metrics

Three optimizer workers. Zero manual prompt tuning.

MIPROv2 (DSPy)

GEPA

auto-research

Labeled test packs. Strong training signal.

Init + Add Cases

Label (Web UI)

Freeze + Publish

Why an active eval platform matters

Part of a self-improving platform

h2t-evals

h2t-graphs

By the Numbers

Full Journey

Foundation closed apr 1

Ground Truth & Improvement Loop closed apr 8

Scale & Cost Optimization closed apr 10

Optimizer Workers closed apr 10

Consumer Integration closed apr 12

Autonomous Loop closed apr 12

DeepEval Integration closed apr 13

Scale & Multi-Tenant current

Dashboard & Webhooks planned

Built for solo + AI teams

Get API Access

Your agents need quality gates,
not just logging