Evaluations

Track agent quality, catch regressions before they reach production, and inspect failed traces. Eval sets, runs, judges, failure clusters, SLOs, and quality gates.

Evaluations

PromptRails Evaluations is a quality control surface for your agents, prompts, and models. It answers three questions that a trace dashboard can't:

Is my agent or model quality good or bad right now?
What broke, and where?
What should I do to fix it?

The page lives at /w/<workspace>/traces/evaluations and is organized around six tabs: Overview, Failures, Runs, Eval sets, Configs, and Raw scores.

If you're new to scoring itself, start with Scoring and Evaluation -- that doc covers the underlying primitives (scores, score configs, sources). This doc covers the higher-level constructs built on top of them.

Concepts at a glance

Concept	What it is
EvalSet	A curated collection of test inputs (with optional expected outputs) used as ground truth
EvalSetItem	A single test case inside an EvalSet -- the input + the expected output + tags + provenance
EvalRun	One execution of an EvalSet against a specific versioned target (agent / prompt / model)
EvalRunItem	A per-test-case result inside an EvalRun -- actual output, scores, cost, duration, trace link
JudgeConfig	An LLM-as-judge evaluator: judge model, prompt template, sampling policy, optional cost cap
Annotation	A human reviewer's scoring task. Drives the reviewer queue and feeds judge calibration
FailureCluster	An embedding-based grouping of failing scores per (agent, metric) -- review K clusters, not N raw failures
QualitySLO	An SRE-style quality target: agent + metric + target value + rolling window
QualityGate	A pre-flight check that can block prompt or agent activation when a quality condition is not met

EvalSet was chosen over "Dataset" to avoid collision with the MCP datasource tool type. The naming is intentional and consistent across the API, SDK, and UI.

Workflow

The expected flow once a workspace has some traces:

Add a judge. Open Configs and start from one of the five built-in templates (Accuracy, Safety, Format Compliance, Tone, Tool Selection). Pick a judge model, a sampling policy, and an optional cost cap. The judge automatically scores traces and contributes to your quality metrics.
Promote real traces into an eval set. From the Failures tab, every row has an "Add to eval set…" dropdown that copies the trace's root input + output into a new EvalSetItem with the trace ID stamped as provenance. This is the cheapest way to build a regression test set -- your worst real production responses become your golden examples.
Trigger a run. When you ship a prompt or model change, dispatch an EvalRun against the new version with the same EvalSet + judges. Items are pre-allocated as pending, the worker executes them with bounded concurrency, and the dashboard streams progress.
Compare against a baseline. In the Runs tab, every row has a Compare button that opens a side panel with wins / losses / ties per metric, cost delta, and duration delta. Comparing two runs of the same eval set is meaningful (the same test cases line up across both runs); comparing runs of different sets is not, and the picker hides those.
Gate the deploy. Configure a Quality Gate on prompt or agent activation. From that point on, promoting a new version is blocked when the gate's condition fails.

Overview tab

The first thing you see when you open the Evaluations page. Three blocks worth knowing:

Quality Health -- a verdict cell (Improving / Degrading / Stable / Not enough data) computed from a 7-day window vs the previous 7. Below 10 scored items in the current window the verdict deliberately reads "Not enough data" -- noisy samples don't get a confident answer.
Regressing metrics -- only renders when at least one metric dropped by >=3 percentage points. Each row shows the previous → current pass rate, color-coded.
Tool quality -- per-MCP-tool success rate and average duration. Coloring flags tools below 80% in red.

Failures tab

A triage surface for failed scores. A failure is bool=false or numeric<0.5. Three sections stacked top to bottom:

Failure clusters -- failures grouped by embedding similarity per (agent, metric). Each expandable row shows the auto-label, member count, last-seen-ago, and root-cause hints. A hint reads like "76% of failures here share tool_call = search_orders, 3.2× above the workspace baseline (52 samples)".
Triage table -- the raw failure list with filters for metric, agent, source. Each row has an "Add to eval set" dropdown.
Annotation queue -- reviewer-facing task list with status (pending / in_progress / completed / skipped). The sidebar shows Cohen's kappa across reviewers on the most-annotated metric, with a banding hint underneath ("Substantial agreement", "Moderate -- rubric could be tighter", etc.).

Runs tab

Eval run history with the A/B Compare drawer. Click Compare on any row and pick a baseline from the same EvalSet; the drawer shows:

Wins / Ties / Baseline wins (per-item, per-metric counted independently)
Aggregate cost delta and duration delta
A per-metric table with current vs baseline average and a Δ column

Cancelled runs render with a strikethrough so they don't blend into pending runs at a glance.

Eval sets tab

The list view for golden test sets. The most common entry point is the "Promote a failing trace" button on the empty state -- it deep-links to the Failures tab so you build the set from real production failures rather than imagining test inputs.

Items inside an EvalSet carry:

input (JSON, whatever your agent expects)
expected_output (JSON, optional)
expected_tools (string array, e.g. ["search_orders", "format_response"])
metadata (JSON, free-form -- useful for tagging)
source_trace_id (for items promoted from a trace)

Configs tab

Three sections stacked:

Judge wizard

"Start from template → pick a score config → pick a judge model → choose a sampling policy → optional daily cost cap → create".

Built-in templates (GET /api/v1/judge-configs/templates):

Key	Default sampling	Description
`accuracy`	`percent`	Does the output factually answer the input?
`safety`	`all`	Does the output avoid unsafe / harmful content?
`format_compliance`	`percent`	Did the output follow the requested format?
`tone`	`percent`	Does the output match the expected tone?
`tool_selection`	`all`	Did the agent pick the right tools?

Templates are cloned at create time -- editing a template later does not affect existing judges (so we never silently mutate live evaluators).

Sampling types:

Type	Behavior
`all`	Score every eligible trace
`percent`	Random N% sample (cost-aware)
`tag`	Only traces matching a tag
`feedback`	Triggered by negative user feedback
`on_demand`	Manual / eval-run only -- never auto-fires

Caveat: sampling_type and cost_cap_daily are stored and validated but auto-trigger / cap enforcement are scheduled for a follow-up release. Today the meaningful sampling mode is on_demand (triggered from an eval run or by direct API call).

Quality SLOs

SRE-style quality targets bound to (agent, metric, target_value, window_seconds). Example: "billing-agent accuracy ≥ 92% over 7 days".

Each row expands into a burndown bar: actual pass rate, target, sample count, and the budget remaining. When the actual pass rate falls below target the row shows a "breached" badge.

Quality Gates

Quality gates block privileged actions when conditions aren't met. See the dedicated section below.

Scheduled runs

Cron-driven EvalRun definitions. Cron expressions are parsed at create time (5-field standard plus @hourly / @daily / @weekly descriptors). Common cron forms are humanized next to the raw expression so non-cron-natives can read them at a glance.

Caveat: Schedule rows can be created and next_run_at is computed, but the periodic dispatcher that actually fires them ships in a follow-up. Until then schedules are a recipe, not an alarm.

Quality Gates

A quality gate is an opt-in pre-flight check that can refuse an activation when quality isn't where you want it. Today you can attach a gate to two actions:

Prompt activation -- promoting a prompt version to current
Agent activation -- promoting an agent version to current

Each gate evaluates one of three conditions:

Run pass rate -- the latest eval run for this target must have a pass rate at or above your threshold
No regression -- the latest eval run must not have lost ground against a baseline run you nominated
SLO not breached -- a Quality SLO you tie the gate to must currently be inside its error budget

Hard gates vs soft gates

When you create a gate, you decide how strict it is:

Hard gate. A failing gate always blocks the activation. The only way through is to fix the underlying problem.
Soft gate. A failing gate blocks by default, but the caller can override with a reason. The override is recorded in the audit trail so reviewers can see who shipped past the warning and why.

Failure response

A blocked promote returns 412 Precondition Failed with a structured body the SDKs and UI render automatically: a list of the gates that failed, each with its name and the reason, plus whether an override is allowed and the header to use when you decide to use it.

Defensive behavior

Two operator-friendly defaults you should know:

If the evaluation system itself is unavailable, activations are not blocked. A database outage or a transient error in the gate service won't lock you out of shipping. The rule is "evaluate when possible, don't break the deploy path when the eval system is down."
Gates referencing missing data degrade per their override setting. If a gate points at an SLO you've since deleted, a hard gate still blocks (it errs safe), and a soft gate lets the action through.

Multi-judge consensus

For high-stakes metrics, you can score one trace with multiple judges and take a consensus.

Consensus runs every configured judge on the same trace in parallel and writes a single score (source = consensus) based on the median of the individual judge values -- resilient to one outlier judge. When the judges disagree by more than 25 percentage points, the result is flagged as a disagreement so a human reviewer can take a second look.

The median (not mean) is deliberate. Mean lets one rogue judge drag the consensus down by half; median requires more than half of judges to agree to move the result.

Programmatic access

Everything you see in the UI is also available over the API. Eval sets, runs, judge configs, annotations, failure clusters, SLOs, scheduled runs, and quality gates all have full CRUD endpoints with read and write scopes for API keys. See the REST API Reference for request and response shapes.

Known limits

A few capabilities are visible in the UI but not yet fully active. They are listed here so you know what to expect:

Automatic judge sampling. Judges run on-demand (from an eval run or via the API). Auto-sampling modes -- all, percent, tag, and on negative feedback -- can be configured today but do not auto-fire yet.
Daily cost caps for judges. The cap value is stored on each judge, but it is not yet enforced; budget yourself accordingly.
Scheduled runs. You can create a cron schedule and the next-run time is calculated, but the dispatcher that actually fires those runs is on the way.
Root-cause hints. The clustering worker groups failures correctly today; the statistical pattern miner that fills the "76% of failures here share X" hints ships in a follow-up.
Consensus disagreement queue. When judges disagree by more than 25%, the disagreement is logged. Automatic reviewer assignment for those cases is on the roadmap.

Scoring and Evaluation -- The underlying score / score-config primitives this surface is built on
Tracing -- The traces evaluations score against
Executions -- What an eval run replays
REST API Reference -- Full request / response shapes
Agent Versioning, Prompt Versioning -- The activation surfaces that consult quality gates

Evaluations

Evaluations

Concepts at a glance

Workflow

Overview tab

Failures tab

Runs tab

Eval sets tab

Configs tab

Judge wizard

Quality SLOs

Quality Gates

Scheduled runs

Quality Gates

Hard gates vs soft gates

Failure response

Defensive behavior

Multi-judge consensus

Programmatic access

Known limits

Related topics