Scoring and Evaluation

Score and evaluate agent executions with numeric, categorical, and boolean metrics using manual, API, or LLM judge scoring.

Scoring and Evaluation

PromptRails includes a scoring system for evaluating agent performance. You can score executions and individual spans with numeric, categorical, or boolean metrics using manual annotation, API-based automation, or LLM judge scoring.

Overview

Scores provide structured feedback on agent outputs. They enable:

Quality monitoring -- Track how well agents perform over time
A/B testing -- Compare agent versions quantitatively
Regression detection -- Identify when changes degrade performance
Human evaluation -- Collect manual ratings from reviewers
Automated evaluation -- Use LLM judges to score outputs at scale

Score Types

Numeric

A score with a numeric value, optionally bounded by min/max values.

client.scores.create(
    trace_id="trace-id",
    name="relevance",
    data_type="numeric",
    value=4.5,
    comment="Response was highly relevant to the query"
)

Categorical

A score from a predefined set of categories.

client.scores.create(
    trace_id="trace-id",
    name="quality",
    data_type="categorical",
    string_value="good",
    comment="Clear and helpful response"
)

Boolean

A binary pass/fail score.

client.scores.create(
    trace_id="trace-id",
    name="factually_correct",
    data_type="boolean",
    bool_value=True,
    comment="All facts were verified"
)

Score Configurations

Score configurations (templates) define reusable scoring schemas that ensure consistency across evaluations.

Creating a Score Config

# Numeric config with range
config = client.scores.create_config(
    name="Response Quality",
    data_type="numeric",
    min_value=1.0,
    max_value=5.0,
    description="Rate the overall quality of the agent's response"
)
 
# Categorical config
config = client.scores.create_config(
    name="Sentiment Accuracy",
    data_type="categorical",
    categories=["correct", "partially_correct", "incorrect"],
    description="How accurately the agent identified the sentiment"
)
 
# Boolean config
config = client.scores.create_config(
    name="Contains Hallucination",
    data_type="boolean",
    description="Whether the response contains fabricated information"
)

Using a Config When Scoring

client.scores.create(
    trace_id="trace-id",
    config_id=config["data"]["id"],
    name="Response Quality",
    data_type="numeric",
    value=4.0
)

Score Sources

Each score records how it was generated:

Source	Identifier	Description
Manual	`manual`	Scored by a human reviewer through the UI or API
API	`api`	Scored programmatically via the API
LLM Judge	`llm_judge`	Scored automatically by an LLM evaluator

LLM Judge Scoring

PromptRails supports automated scoring using LLM judges. An LLM judge evaluates agent outputs against criteria you define:

# Automated scoring is configured through score configs
# and triggered during or after execution
config = client.scores.create_config(
    name="Helpfulness",
    data_type="numeric",
    min_value=1.0,
    max_value=5.0,
    description="How helpful was the response to the user's question?"
)

LLM judges evaluate outputs based on the score config's description and data type, producing consistent, scalable evaluations.

Execution-Level Scoring

Score an entire execution:

client.scores.create(
    trace_id="trace-id",
    execution_id="execution-id",
    agent_id="agent-id",
    name="overall_quality",
    data_type="numeric",
    value=4.0,
    source="manual",
    comment="Good response with minor formatting issues"
)

Span-Level Scoring

Score an individual span within an execution (e.g., a specific LLM call or tool result):

client.scores.create(
    trace_id="trace-id",
    span_id="span-id",
    name="tool_accuracy",
    data_type="boolean",
    bool_value=True,
    source="api",
    comment="Tool returned correct data"
)

Listing Scores

# List scores for a trace
scores = client.scores.list(trace_id="trace-id")
 
# List scores for an agent
scores = client.scores.list(agent_id="agent-id", page=1, limit=50)
 
for score in scores["data"]:
    print(f"{score['name']}: {score.get('value') or score.get('string_value') or score.get('bool_value')}")

Score Fields

Field	Type	Description
`id`	KSUID	Unique score identifier
`workspace_id`	KSUID	Workspace scope
`trace_id`	string	Associated trace
`span_id`	string	Optional specific span
`name`	string	Score name
`value`	float	Numeric value (for numeric type)
`string_value`	string	Category value (for categorical type)
`bool_value`	boolean	Boolean value (for boolean type)
`data_type`	string	`numeric`, `categorical`, or `boolean`
`comment`	string	Optional comment or explanation
`source`	string	`manual`, `api`, or `llm_judge`
`config_id`	KSUID	Optional score config reference
`execution_id`	KSUID	Optional execution reference
`agent_id`	KSUID	Optional agent reference
`created_by_id`	KSUID	User who created the score
`created_at`	timestamp	Creation time

Best Practices

Define score configs early -- Create standardized scoring templates before starting evaluation
Score consistently -- Use the same configs across agent versions for meaningful comparisons
Combine sources -- Use manual scoring for calibration and LLM judges for scale
Score at the right level -- Use execution-level scores for overall quality and span-level scores for component accuracy
Track over time -- Monitor score trends to detect regressions

Executions -- What gets scored
Tracing -- Spans that can be individually scored
Cost Tracking -- Correlate quality with cost

Scoring and Evaluation

Scoring and Evaluation

Overview

Score Types

Numeric

Categorical

Boolean

Score Configurations

Creating a Score Config

Using a Config When Scoring

Score Sources

LLM Judge Scoring

Execution-Level Scoring

Span-Level Scoring

Listing Scores

Score Fields

Best Practices

Related Topics