Scoring and Evaluation
Score and evaluate agent executions with numeric, categorical, and boolean metrics using manual, API, or LLM judge scoring.
Scoring and Evaluation
PromptRails includes a scoring system for evaluating agent performance. You can score executions and individual spans with numeric, categorical, or boolean metrics using manual annotation, API-based automation, or LLM judge scoring.
Overview
Scores provide structured feedback on agent outputs. They enable:
- Quality monitoring -- Track how well agents perform over time
- A/B testing -- Compare agent versions quantitatively
- Regression detection -- Identify when changes degrade performance
- Human evaluation -- Collect manual ratings from reviewers
- Automated evaluation -- Use LLM judges to score outputs at scale
Score Types
Numeric
A score with a numeric value, optionally bounded by min/max values.
client.scores.create(
trace_id="trace-id",
name="relevance",
data_type="numeric",
value=4.5,
comment="Response was highly relevant to the query"
)Categorical
A score from a predefined set of categories.
client.scores.create(
trace_id="trace-id",
name="quality",
data_type="categorical",
string_value="good",
comment="Clear and helpful response"
)Boolean
A binary pass/fail score.
client.scores.create(
trace_id="trace-id",
name="factually_correct",
data_type="boolean",
bool_value=True,
comment="All facts were verified"
)Score Configurations
Score configurations (templates) define reusable scoring schemas that ensure consistency across evaluations.
Creating a Score Config
# Numeric config with range
config = client.scores.create_config(
name="Response Quality",
data_type="numeric",
min_value=1.0,
max_value=5.0,
description="Rate the overall quality of the agent's response"
)
# Categorical config
config = client.scores.create_config(
name="Sentiment Accuracy",
data_type="categorical",
categories=["correct", "partially_correct", "incorrect"],
description="How accurately the agent identified the sentiment"
)
# Boolean config
config = client.scores.create_config(
name="Contains Hallucination",
data_type="boolean",
description="Whether the response contains fabricated information"
)Using a Config When Scoring
client.scores.create(
trace_id="trace-id",
config_id=config["data"]["id"],
name="Response Quality",
data_type="numeric",
value=4.0
)Score Sources
Each score records how it was generated:
| Source | Identifier | Description |
|---|---|---|
| Manual | manual | Scored by a human reviewer through the UI or API |
| API | api | Scored programmatically via the API |
| LLM Judge | llm_judge | Scored automatically by an LLM evaluator |
LLM Judge Scoring
PromptRails supports automated scoring using LLM judges. An LLM judge evaluates agent outputs against criteria you define:
# Automated scoring is configured through score configs
# and triggered during or after execution
config = client.scores.create_config(
name="Helpfulness",
data_type="numeric",
min_value=1.0,
max_value=5.0,
description="How helpful was the response to the user's question?"
)LLM judges evaluate outputs based on the score config's description and data type, producing consistent, scalable evaluations.
Execution-Level Scoring
Score an entire execution:
client.scores.create(
trace_id="trace-id",
execution_id="execution-id",
agent_id="agent-id",
name="overall_quality",
data_type="numeric",
value=4.0,
source="manual",
comment="Good response with minor formatting issues"
)Span-Level Scoring
Score an individual span within an execution (e.g., a specific LLM call or tool result):
client.scores.create(
trace_id="trace-id",
span_id="span-id",
name="tool_accuracy",
data_type="boolean",
bool_value=True,
source="api",
comment="Tool returned correct data"
)Listing Scores
# List scores for a trace
scores = client.scores.list(trace_id="trace-id")
# List scores for an agent
scores = client.scores.list(agent_id="agent-id", page=1, limit=50)
for score in scores["data"]:
print(f"{score['name']}: {score.get('value') or score.get('string_value') or score.get('bool_value')}")Score Fields
| Field | Type | Description |
|---|---|---|
id | KSUID | Unique score identifier |
workspace_id | KSUID | Workspace scope |
trace_id | string | Associated trace |
span_id | string | Optional specific span |
name | string | Score name |
value | float | Numeric value (for numeric type) |
string_value | string | Category value (for categorical type) |
bool_value | boolean | Boolean value (for boolean type) |
data_type | string | numeric, categorical, or boolean |
comment | string | Optional comment or explanation |
source | string | manual, api, or llm_judge |
config_id | KSUID | Optional score config reference |
execution_id | KSUID | Optional execution reference |
agent_id | KSUID | Optional agent reference |
created_by_id | KSUID | User who created the score |
created_at | timestamp | Creation time |
Best Practices
- Define score configs early -- Create standardized scoring templates before starting evaluation
- Score consistently -- Use the same configs across agent versions for meaningful comparisons
- Combine sources -- Use manual scoring for calibration and LLM judges for scale
- Score at the right level -- Use execution-level scores for overall quality and span-level scores for component accuracy
- Track over time -- Monitor score trends to detect regressions
Related Topics
- Executions -- What gets scored
- Tracing -- Spans that can be individually scored
- Cost Tracking -- Correlate quality with cost