# Scoring and Evaluation

> Score and evaluate agent executions with numeric, categorical, and boolean metrics using manual, API, or LLM judge scoring.

Source: https://0.0.0.0:8080/docs/scoring-and-evaluation

PromptRails includes a scoring system for evaluating agent performance. You can score executions and individual spans with numeric, categorical, or boolean metrics using manual annotation, API-based automation, or LLM judge scoring.

## Overview

Scores provide structured feedback on agent outputs. They enable:

- **Quality monitoring** -- Track how well agents perform over time
- **A/B testing** -- Compare agent versions quantitatively
- **Regression detection** -- Identify when changes degrade performance
- **Human evaluation** -- Collect manual ratings from reviewers
- **Automated evaluation** -- Use LLM judges to score outputs at scale

## Score Types

### Numeric

A score with a numeric value, optionally bounded by min/max values.

```python
client.scores.create(
    trace_id="trace-id",
    name="relevance",
    data_type="numeric",
    value=4.5,
    comment="Response was highly relevant to the query"
)
```

### Categorical

A score from a predefined set of categories.

```python
client.scores.create(
    trace_id="trace-id",
    name="quality",
    data_type="categorical",
    string_value="good",
    comment="Clear and helpful response"
)
```

### Boolean

A binary pass/fail score.

```python
client.scores.create(
    trace_id="trace-id",
    name="factually_correct",
    data_type="boolean",
    bool_value=True,
    comment="All facts were verified"
)
```

## Score Configurations

Score configurations (templates) define reusable scoring schemas that ensure consistency across evaluations.

### Creating a Score Config

```python
# Numeric config with range
config = client.scores.create_config(
    name="Response Quality",
    data_type="numeric",
    min_value=1.0,
    max_value=5.0,
    description="Rate the overall quality of the agent's response"
)

# Categorical config
config = client.scores.create_config(
    name="Sentiment Accuracy",
    data_type="categorical",
    categories=["correct", "partially_correct", "incorrect"],
    description="How accurately the agent identified the sentiment"
)

# Boolean config
config = client.scores.create_config(
    name="Contains Hallucination",
    data_type="boolean",
    description="Whether the response contains fabricated information"
)
```

### Using a Config When Scoring

```python
client.scores.create(
    trace_id="trace-id",
    config_id=config["data"]["id"],
    name="Response Quality",
    data_type="numeric",
    value=4.0
)
```

## Score Sources

Each score records how it was generated:

| Source        | Identifier  | Description                                      |
| ------------- | ----------- | ------------------------------------------------ |
| **Manual**    | `manual`    | Scored by a human reviewer through the UI or API |
| **API**       | `api`       | Scored programmatically via the API              |
| **LLM Judge** | `llm_judge` | Scored automatically by an LLM evaluator         |

## LLM Judge Scoring

PromptRails supports automated scoring using LLM judges. An LLM judge evaluates agent outputs against criteria you define:

```python
# Automated scoring is configured through score configs
# and triggered during or after execution
config = client.scores.create_config(
    name="Helpfulness",
    data_type="numeric",
    min_value=1.0,
    max_value=5.0,
    description="How helpful was the response to the user's question?"
)
```

LLM judges evaluate outputs based on the score config's description and data type, producing consistent, scalable evaluations.

## Execution-Level Scoring

Score an entire execution:

```python
client.scores.create(
    trace_id="trace-id",
    execution_id="execution-id",
    agent_id="agent-id",
    name="overall_quality",
    data_type="numeric",
    value=4.0,
    source="manual",
    comment="Good response with minor formatting issues"
)
```

## Span-Level Scoring

Score an individual span within an execution (e.g., a specific LLM call or tool result):

```python
client.scores.create(
    trace_id="trace-id",
    span_id="span-id",
    name="tool_accuracy",
    data_type="boolean",
    bool_value=True,
    source="api",
    comment="Tool returned correct data"
)
```

## Listing Scores

```python
# List scores for a trace
scores = client.scores.list(trace_id="trace-id")

# List scores for an agent
scores = client.scores.list(agent_id="agent-id", page=1, limit=50)

for score in scores["data"]:
    print(f"{score['name']}: {score.get('value') or score.get('string_value') or score.get('bool_value')}")
```

## Score Fields

| Field           | Type      | Description                            |
| --------------- | --------- | -------------------------------------- |
| `id`            | KSUID     | Unique score identifier                |
| `workspace_id`  | KSUID     | Workspace scope                        |
| `trace_id`      | string    | Associated trace                       |
| `span_id`       | string    | Optional specific span                 |
| `name`          | string    | Score name                             |
| `value`         | float     | Numeric value (for numeric type)       |
| `string_value`  | string    | Category value (for categorical type)  |
| `bool_value`    | boolean   | Boolean value (for boolean type)       |
| `data_type`     | string    | `numeric`, `categorical`, or `boolean` |
| `comment`       | string    | Optional comment or explanation        |
| `source`        | string    | `manual`, `api`, or `llm_judge`        |
| `config_id`     | KSUID     | Optional score config reference        |
| `execution_id`  | KSUID     | Optional execution reference           |
| `agent_id`      | KSUID     | Optional agent reference               |
| `created_by_id` | KSUID     | User who created the score             |
| `created_at`    | timestamp | Creation time                          |

## Best Practices

- **Define score configs early** -- Create standardized scoring templates before starting evaluation
- **Score consistently** -- Use the same configs across agent versions for meaningful comparisons
- **Combine sources** -- Use manual scoring for calibration and LLM judges for scale
- **Score at the right level** -- Use execution-level scores for overall quality and span-level scores for component accuracy
- **Track over time** -- Monitor score trends to detect regressions

## Related Topics

- [Executions](/docs/executions) -- What gets scored
- [Tracing](/docs/tracing) -- Spans that can be individually scored
- [Cost Tracking](/docs/cost-tracking) -- Correlate quality with cost
