PromptRails

Guardrails

Protect your agents with 14 built-in scanner types for input and output validation, including toxicity detection, PII filtering, and prompt injection prevention.

Guardrails

Guardrails are safety scanners that validate agent inputs and outputs before and after LLM execution. They protect against harmful content, data leakage, prompt injection, and other risks.

What Are Guardrails?

A guardrail is a scanner attached to an agent that inspects content at a specific point in the execution pipeline:

  • Input guardrails scan the user's input before it reaches the LLM
  • Output guardrails scan the LLM's response before it is returned to the user

Each guardrail has a scanner type, an action to take when triggered, and a sort order that determines the evaluation sequence.

Direction: Input vs Output

DirectionWhen It RunsPurpose
inputBefore LLM executionValidate user input, block prompt injection, detect PII
outputAfter LLM executionFilter harmful responses, redact sensitive data, enforce content policies

Scanner Types

PromptRails includes 14 built-in scanner types:

Content Safety

ScannerIdentifierDescription
ToxicitytoxicityDetects toxic, abusive, or hateful language in text
Harmful ContentharmfulIdentifies content that promotes harm, violence, or illegal activities
Bias DetectionbiasDetects biased or discriminatory language
No Refusalno_refusalEnsures the LLM does not refuse to answer (output only)

Data Protection

ScannerIdentifierDescription
PII DetectionpiiDetects personally identifiable information (names, emails, phone numbers, SSNs, etc.)
AnonymizeanonymizeReplaces detected PII with placeholder tokens
Secrets DetectionsecretsDetects API keys, passwords, tokens, and other secrets in text
Sensitive DatasensitiveDetects broader categories of sensitive information

Security

ScannerIdentifierDescription
Prompt Injectionprompt_injectionDetects attempts to override system instructions or inject malicious prompts
Invisible Textinvisible_textDetects hidden Unicode characters or zero-width text used for injection
Malicious URLsmalicious_urlsDetects known malicious, phishing, or suspicious URLs

Content Filtering

ScannerIdentifierDescription
Substring Banban_substringsBlocks content containing specified banned words or phrases
Topic Banban_topicsBlocks content related to specified banned topics
Language DetectionlanguageEnsures content is in the expected language(s)

Actions

When a guardrail scanner triggers, it takes one of three actions:

ActionBehavior
blockStops execution and returns an error. The LLM response is not delivered.
redactRemoves or replaces the offending content and continues execution.
logRecords the detection in the trace but allows execution to continue.

Configuring Guardrails

Guardrails are configured per agent. Each agent can have multiple guardrails with different scanners, directions, and actions.

Python SDK

# Add an input guardrail for prompt injection
client.guardrails.create(
    agent_id="your-agent-id",
    type="input",
    scanner_type="prompt_injection",
    action="block",
    sort_order=1,
    config={}
)
 
# Add an output guardrail for PII
client.guardrails.create(
    agent_id="your-agent-id",
    type="output",
    scanner_type="pii",
    action="redact",
    sort_order=1,
    config={
        "entities": ["email", "phone", "ssn", "credit_card"]
    }
)
 
# Add a substring ban
client.guardrails.create(
    agent_id="your-agent-id",
    type="input",
    scanner_type="ban_substrings",
    action="block",
    sort_order=2,
    config={
        "substrings": ["ignore previous instructions", "system prompt"],
        "case_sensitive": False
    }
)

JavaScript SDK

await client.guardrails.create({
  agentId: 'your-agent-id',
  type: 'input',
  scannerType: 'prompt_injection',
  action: 'block',
  sortOrder: 1,
  config: {},
})
 
await client.guardrails.create({
  agentId: 'your-agent-id',
  type: 'output',
  scannerType: 'pii',
  action: 'redact',
  sortOrder: 1,
  config: {
    entities: ['email', 'phone', 'ssn', 'credit_card'],
  },
})

Sort Order

Guardrails execute in sort order (ascending) within each direction. Lower numbers execute first.

A typical input guardrail ordering might be:

  1. invisible_text (sort_order: 1) -- Detect hidden characters first
  2. prompt_injection (sort_order: 2) -- Block injection attempts
  3. toxicity (sort_order: 3) -- Filter toxic content
  4. ban_substrings (sort_order: 4) -- Apply custom word filters

If a guardrail with block action triggers, subsequent guardrails are not evaluated.

Scanner Configuration

Each scanner type accepts a configuration object (config) for customization:

ban_substrings

{
  "substrings": ["forbidden phrase", "blocked word"],
  "case_sensitive": false
}

ban_topics

{
  "topics": ["politics", "religion", "gambling"]
}

pii

{
  "entities": ["email", "phone", "ssn", "credit_card", "address"]
}

language

{
  "languages": ["en", "es", "fr"],
  "action_on_mismatch": "block"
}

toxicity, harmful, bias, prompt_injection, secrets, invisible_text, malicious_urls, anonymize, no_refusal, sensitive

These scanners typically work with an empty configuration object {} and use their built-in detection models.

Managing Guardrails

# List guardrails for an agent
guardrails = client.guardrails.list(agent_id="your-agent-id")
 
# Update a guardrail
client.guardrails.update(
    guardrail_id="guardrail-id",
    action="log",  # Change from block to log
    is_active=True
)
 
# Disable a guardrail (without deleting)
client.guardrails.update(
    guardrail_id="guardrail-id",
    is_active=False
)
 
# Delete a guardrail
client.guardrails.delete(guardrail_id="guardrail-id")

Guardrail Traces

Every guardrail evaluation produces a guardrail span in the execution trace, recording:

  • Which scanner was used
  • Whether it triggered
  • What action was taken
  • The duration of the scan
  • Any details about detected content

This provides full visibility into why content was blocked or redacted.

Best Practices

  • Layer your guardrails -- Use multiple scanners in combination for defense in depth
  • Start with log mode -- Monitor what would be caught before switching to block
  • Prioritize injection prevention -- Always run prompt_injection on inputs
  • Protect PII -- Use pii or anonymize on outputs to prevent data leakage
  • Test with adversarial inputs -- Verify your guardrails catch edge cases
  • Monitor guardrail traces -- Review blocked content regularly to tune configurations
  • Agents -- Attaching guardrails to agents
  • Tracing -- Guardrail evaluation spans
  • Security -- Overall security architecture