OpenAI compatible API. Attested gateway. Public status.

TrustedRouter Evals Guide

Run model, provider, privacy, latency, and cost evals through one OpenAI compatible API.

Verify gateway
1 URLbase_url migration
100smodels and routes
0prompt logs by default

Run evals through TrustedRouter

model switching

Use TrustedRouter when an agent or eval runner needs to compare models, providers, privacy tiers, and cost without rewriting client code.

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["TRUSTEDROUTER_API_KEY"],
    base_url="https://api.trustedrouter.com/v1",
)

def ask(model: str, prompt: str) -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=300,
    )
    return response.choices[0].message.content or ""

for model in ["trustedrouter/zdr", "trustedrouter/eu", "trustedrouter/auto", "trustedrouter/cheap"]:
    print(model, ask(model, "Recommend 5 real albums for fans of Talk Talk."))

What to record

metadata only
  • Model and provider route selected.
  • Latency, time to first token, and token counts.
  • Cost in integer microdollars, not floats.
  • Evaluator score and blind judge notes.
  • Request ID for debugging. Do not store prompt/output unless the eval owner explicitly opts in.

Default test pools

safe starts
  • trustedrouter/zdr for legal, medical, financial, and private customer data.
  • trustedrouter/e2e when the eval needs an end-to-end encrypted provider route.
  • trustedrouter/eu for Europe-focused model and provider selection.
  • trustedrouter/auto for availability and broad fallback testing.
  • trustedrouter/cheap for wide sweeps where cost matters more than top quality.

Cost discipline

before large runs

Start each eval with a small model set and a hard token cap. Check the public leaderboard and model pages before widening the run.

uv run python scripts/fusion_micro_eval.py \
  --mode micro-hybrid \
  --max-cost-usd 1.00

The Fusion micro runner estimates a deterministic 20-task tuning slice, keeps live search on a tiny smoke subset, and refuses runs that would exceed the hard cap.

uv run python scripts/fusion_full_eval.py --pilot --fetch-draco
uv run python scripts/fusion_live_eval.py \
  --task-count 3 \
  --config fusion_tr_budget \
  --budget-usd 5.00 \
  --execute

The full DRACO reproduction estimator budgets 100 tasks, live search on generation calls, and three judge passes before any provider call runs. The live pilot fetches the public DRACO split, uses Exa with benchmark/rubric hostnames excluded, and writes local JSONL results. The hosted trustedrouter/fusion API alias runs panel, judge, and final synthesis calls inside the attested gateway.

Sign in

Choose a sign in method.