PromptRails

Scoring and Evaluation

Score and evaluate agent executions with numeric, categorical, and boolean metrics using manual, API, or LLM judge scoring.

Scoring and Evaluation

PromptRails includes a scoring system for evaluating agent performance. You can score executions and individual spans with numeric, categorical, or boolean metrics using manual annotation, API-based automation, or LLM judge scoring.

Overview

Scores provide structured feedback on agent outputs. They enable:

  • Quality monitoring -- Track how well agents perform over time
  • A/B testing -- Compare agent versions quantitatively
  • Regression detection -- Identify when changes degrade performance
  • Human evaluation -- Collect manual ratings from reviewers
  • Automated evaluation -- Use LLM judges to score outputs at scale

Score Types

Numeric

A score with a numeric value, optionally bounded by min/max values.

client.scores.create(
    trace_id="trace-id",
    name="relevance",
    data_type="numeric",
    value=4.5,
    comment="Response was highly relevant to the query"
)

Categorical

A score from a predefined set of categories.

client.scores.create(
    trace_id="trace-id",
    name="quality",
    data_type="categorical",
    string_value="good",
    comment="Clear and helpful response"
)

Boolean

A binary pass/fail score.

client.scores.create(
    trace_id="trace-id",
    name="factually_correct",
    data_type="boolean",
    bool_value=True,
    comment="All facts were verified"
)

Score Configurations

Score configurations (templates) define reusable scoring schemas that ensure consistency across evaluations.

Creating a Score Config

# Numeric config with range
config = client.scores.create_config(
    name="Response Quality",
    data_type="numeric",
    min_value=1.0,
    max_value=5.0,
    description="Rate the overall quality of the agent's response"
)
 
# Categorical config
config = client.scores.create_config(
    name="Sentiment Accuracy",
    data_type="categorical",
    categories=["correct", "partially_correct", "incorrect"],
    description="How accurately the agent identified the sentiment"
)
 
# Boolean config
config = client.scores.create_config(
    name="Contains Hallucination",
    data_type="boolean",
    description="Whether the response contains fabricated information"
)

Using a Config When Scoring

client.scores.create(
    trace_id="trace-id",
    config_id=config["data"]["id"],
    name="Response Quality",
    data_type="numeric",
    value=4.0
)

Score Sources

Each score records how it was generated:

SourceIdentifierDescription
ManualmanualScored by a human reviewer through the UI or API
APIapiScored programmatically via the API
LLM Judgellm_judgeScored automatically by an LLM evaluator

LLM Judge Scoring

PromptRails supports automated scoring using LLM judges. An LLM judge evaluates agent outputs against criteria you define:

# Automated scoring is configured through score configs
# and triggered during or after execution
config = client.scores.create_config(
    name="Helpfulness",
    data_type="numeric",
    min_value=1.0,
    max_value=5.0,
    description="How helpful was the response to the user's question?"
)

LLM judges evaluate outputs based on the score config's description and data type, producing consistent, scalable evaluations.

Execution-Level Scoring

Score an entire execution:

client.scores.create(
    trace_id="trace-id",
    execution_id="execution-id",
    agent_id="agent-id",
    name="overall_quality",
    data_type="numeric",
    value=4.0,
    source="manual",
    comment="Good response with minor formatting issues"
)

Span-Level Scoring

Score an individual span within an execution (e.g., a specific LLM call or tool result):

client.scores.create(
    trace_id="trace-id",
    span_id="span-id",
    name="tool_accuracy",
    data_type="boolean",
    bool_value=True,
    source="api",
    comment="Tool returned correct data"
)

Listing Scores

# List scores for a trace
scores = client.scores.list(trace_id="trace-id")
 
# List scores for an agent
scores = client.scores.list(agent_id="agent-id", page=1, limit=50)
 
for score in scores["data"]:
    print(f"{score['name']}: {score.get('value') or score.get('string_value') or score.get('bool_value')}")

Score Fields

FieldTypeDescription
idKSUIDUnique score identifier
workspace_idKSUIDWorkspace scope
trace_idstringAssociated trace
span_idstringOptional specific span
namestringScore name
valuefloatNumeric value (for numeric type)
string_valuestringCategory value (for categorical type)
bool_valuebooleanBoolean value (for boolean type)
data_typestringnumeric, categorical, or boolean
commentstringOptional comment or explanation
sourcestringmanual, api, or llm_judge
config_idKSUIDOptional score config reference
execution_idKSUIDOptional execution reference
agent_idKSUIDOptional agent reference
created_by_idKSUIDUser who created the score
created_attimestampCreation time

Best Practices

  • Define score configs early -- Create standardized scoring templates before starting evaluation
  • Score consistently -- Use the same configs across agent versions for meaningful comparisons
  • Combine sources -- Use manual scoring for calibration and LLM judges for scale
  • Score at the right level -- Use execution-level scores for overall quality and span-level scores for component accuracy
  • Track over time -- Monitor score trends to detect regressions