PromptRails

Guardrails

Protect agent runs with checks that block, redact, or log unsafe input and output.

Guardrails are scanners that run before or after a model call. They help protect agent workflows from prompt injection, sensitive data exposure, unsafe content, unwanted topics, and other policy risks.

Guardrails are configured from the agent in Studio. Open the Guardrails tab to add policy checks, choose where they run, and decide whether a match blocks, redacts, or logs the run.

What Are Guardrails?

A guardrail is attached to an agent and inspects content at a specific point in the execution pipeline:

  • Input guardrails scan the user's input before it reaches the LLM
  • Output guardrails scan the LLM's response before it is returned to the user

Each guardrail has a scanner type, an action to take when triggered, and a sort order that determines the evaluation sequence.

Direction: Input vs Output

DirectionWhen It RunsPurpose
inputBefore LLM executionValidate user input, block prompt injection, detect PII
outputAfter LLM executionFilter harmful responses, redact sensitive data, enforce content policies
Technical detailsScanner reference

Scanner Types

PromptRails includes 14 built-in scanner types:

Content Safety

ScannerIdentifierDescription
ToxicitytoxicityDetects toxic, abusive, or hateful language in text
Harmful ContentharmfulIdentifies content that promotes harm, violence, or illegal activities
Bias DetectionbiasDetects biased or discriminatory language
No Refusalno_refusalEnsures the LLM does not refuse to answer (output only)

Data Protection

ScannerIdentifierDescription
PII DetectionpiiDetects personally identifiable information (names, emails, phone numbers, SSNs, etc.)
AnonymizeanonymizeReplaces detected PII with placeholder tokens
Secrets DetectionsecretsDetects API keys, passwords, tokens, and other secrets in text
Sensitive DatasensitiveDetects broader categories of sensitive information

Security

ScannerIdentifierDescription
Prompt Injectionprompt_injectionDetects attempts to override system instructions or inject malicious prompts
Invisible Textinvisible_textDetects hidden Unicode characters or zero-width text used for injection
Malicious URLsmalicious_urlsDetects known malicious, phishing, or suspicious URLs

Content Filtering

ScannerIdentifierDescription
Substring Banban_substringsBlocks content containing specified banned words or phrases
Topic Banban_topicsBlocks content related to specified banned topics
Language DetectionlanguageEnsures content is in the expected language(s)

Actions

When a guardrail scanner triggers, it takes one of three actions:

ActionBehavior
blockStops execution and returns an error. The LLM response is not delivered.
redactRemoves or replaces the offending content and continues execution.
logRecords the detection in the trace but allows execution to continue.

Configuring Guardrails

Guardrails are configured per agent. Each agent can have multiple guardrails with different scanners, directions, and actions.

Use Studio for day-to-day configuration. Use the SDK when guardrail setup needs to be generated, synced, or promoted by an internal release workflow.

Technical detailsConfigure guardrails with SDKs

Python SDK

# Add an input guardrail for prompt injection
client.guardrails.create(
    agent_id="your-agent-id",
    type="input",
    scanner_type="prompt_injection",
    action="block",
    sort_order=1,
    config={}
)
 
# Add an output guardrail for PII
client.guardrails.create(
    agent_id="your-agent-id",
    type="output",
    scanner_type="pii",
    action="redact",
    sort_order=1,
    config={
        "entities": ["email", "phone", "ssn", "credit_card"]
    }
)
 
# Add a substring ban
client.guardrails.create(
    agent_id="your-agent-id",
    type="input",
    scanner_type="ban_substrings",
    action="block",
    sort_order=2,
    config={
        "substrings": ["ignore previous instructions", "system prompt"],
        "case_sensitive": False
    }
)

JavaScript SDK

await client.guardrails.create({
  agentId: 'your-agent-id',
  type: 'input',
  scannerType: 'prompt_injection',
  action: 'block',
  sortOrder: 1,
  config: {},
})
 
await client.guardrails.create({
  agentId: 'your-agent-id',
  type: 'output',
  scannerType: 'pii',
  action: 'redact',
  sortOrder: 1,
  config: {
    entities: ['email', 'phone', 'ssn', 'credit_card'],
  },
})

Sort Order

Guardrails execute in sort order (ascending) within each direction. Lower numbers execute first.

A typical input guardrail ordering might be:

  1. invisible_text (sort_order: 1) -- Detect hidden characters first
  2. prompt_injection (sort_order: 2) -- Block injection attempts
  3. toxicity (sort_order: 3) -- Filter toxic content
  4. ban_substrings (sort_order: 4) -- Apply custom word filters

If a guardrail with block action triggers, subsequent guardrails are not evaluated.

Technical detailsScanner configuration details

Scanner Configuration

Each scanner type accepts a configuration object (config) for customization:

ban_substrings

{
  "substrings": ["forbidden phrase", "blocked word"],
  "case_sensitive": false
}

ban_topics

{
  "topics": ["politics", "religion", "gambling"]
}

pii

{
  "entities": ["email", "phone", "ssn", "credit_card", "address"]
}

language

{
  "languages": ["en", "es", "fr"],
  "action_on_mismatch": "block"
}

toxicity, harmful, bias, prompt_injection, secrets, invisible_text, malicious_urls, anonymize, no_refusal, sensitive

These scanners typically work with an empty configuration object {} and use their built-in detection models.

Managing Guardrails

# List guardrails for an agent
guardrails = client.guardrails.list(agent_id="your-agent-id")
 
# Update a guardrail
client.guardrails.update(
    guardrail_id="guardrail-id",
    action="log",  # Change from block to log
    is_active=True
)
 
# Disable a guardrail (without deleting)
client.guardrails.update(
    guardrail_id="guardrail-id",
    is_active=False
)
 
# Delete a guardrail
client.guardrails.delete(guardrail_id="guardrail-id")

Guardrail Traces

Every guardrail evaluation produces a guardrail span in the execution trace, recording:

  • Which scanner was used
  • Whether it triggered
  • What action was taken
  • The duration of the scan
  • Any details about detected content

This provides full visibility into why content was blocked or redacted.

Best Practices

  • Layer your guardrails -- Use multiple scanners in combination for defense in depth
  • Start with log mode -- Monitor what would be caught before switching to block
  • Prioritize injection prevention -- Always run prompt_injection on inputs
  • Protect PII -- Use pii or anonymize on outputs to prevent data leakage
  • Test with adversarial inputs -- Verify your guardrails catch edge cases
  • Monitor guardrail traces -- Review blocked content regularly to tune configurations
  • Agents -- Attaching guardrails to agents
  • Tracing -- Guardrail evaluation spans
  • Security -- Overall security architecture