Guardrails

Protect agent runs with checks that block, redact, or log unsafe input and output.

Guardrails are scanners that run before or after a model call. They help protect agent workflows from prompt injection, sensitive data exposure, unsafe content, unwanted topics, and other policy risks.

PromptRails Studio showing the Guardrails tab on an agent — Guardrails are configured from the agent in Studio. Open the Guardrails tab to add policy checks, choose where they run, and decide whether a match blocks, redacts, or logs the run.

What Are Guardrails?

A guardrail is attached to an agent and inspects content at a specific point in the execution pipeline:

Input guardrails scan the user's input before it reaches the LLM
Output guardrails scan the LLM's response before it is returned to the user

Each guardrail has a scanner type, an action to take when triggered, and a sort order that determines the evaluation sequence.

Direction: Input vs Output

Direction	When It Runs	Purpose
`input`	Before LLM execution	Validate user input, block prompt injection, detect PII
`output`	After LLM execution	Filter harmful responses, redact sensitive data, enforce content policies

Technical detailsScanner reference

Scanner Types

PromptRails includes 14 built-in scanner types:

Content Safety

Scanner	Identifier	Description
Toxicity	`toxicity`	Detects toxic, abusive, or hateful language in text
Harmful Content	`harmful`	Identifies content that promotes harm, violence, or illegal activities
Bias Detection	`bias`	Detects biased or discriminatory language
No Refusal	`no_refusal`	Ensures the LLM does not refuse to answer (output only)

Data Protection

Scanner	Identifier	Description
PII Detection	`pii`	Detects personally identifiable information (names, emails, phone numbers, SSNs, etc.)
Anonymize	`anonymize`	Replaces detected PII with placeholder tokens
Secrets Detection	`secrets`	Detects API keys, passwords, tokens, and other secrets in text
Sensitive Data	`sensitive`	Detects broader categories of sensitive information

Security

Scanner	Identifier	Description
Prompt Injection	`prompt_injection`	Detects attempts to override system instructions or inject malicious prompts
Invisible Text	`invisible_text`	Detects hidden Unicode characters or zero-width text used for injection
Malicious URLs	`malicious_urls`	Detects known malicious, phishing, or suspicious URLs

Content Filtering

Scanner	Identifier	Description
Substring Ban	`ban_substrings`	Blocks content containing specified banned words or phrases
Topic Ban	`ban_topics`	Blocks content related to specified banned topics
Language Detection	`language`	Ensures content is in the expected language(s)

Actions

When a guardrail scanner triggers, it takes one of three actions:

Action	Behavior
`block`	Stops execution and returns an error. The LLM response is not delivered.
`redact`	Removes or replaces the offending content and continues execution.
`log`	Records the detection in the trace but allows execution to continue.

Configuring Guardrails

Guardrails are configured per agent. Each agent can have multiple guardrails with different scanners, directions, and actions.

Use Studio for day-to-day configuration. Use the SDK when guardrail setup needs to be generated, synced, or promoted by an internal release workflow.

Technical detailsConfigure guardrails with SDKs

Python SDK

# Add an input guardrail for prompt injection
client.guardrails.create(
    agent_id="your-agent-id",
    type="input",
    scanner_type="prompt_injection",
    action="block",
    sort_order=1,
    config={}
)
 
# Add an output guardrail for PII
client.guardrails.create(
    agent_id="your-agent-id",
    type="output",
    scanner_type="pii",
    action="redact",
    sort_order=1,
    config={
        "entities": ["email", "phone", "ssn", "credit_card"]
    }
)
 
# Add a substring ban
client.guardrails.create(
    agent_id="your-agent-id",
    type="input",
    scanner_type="ban_substrings",
    action="block",
    sort_order=2,
    config={
        "substrings": ["ignore previous instructions", "system prompt"],
        "case_sensitive": False
    }
)

JavaScript SDK

await client.guardrails.create({
  agentId: 'your-agent-id',
  type: 'input',
  scannerType: 'prompt_injection',
  action: 'block',
  sortOrder: 1,
  config: {},
})
 
await client.guardrails.create({
  agentId: 'your-agent-id',
  type: 'output',
  scannerType: 'pii',
  action: 'redact',
  sortOrder: 1,
  config: {
    entities: ['email', 'phone', 'ssn', 'credit_card'],
  },
})

Sort Order

Guardrails execute in sort order (ascending) within each direction. Lower numbers execute first.

A typical input guardrail ordering might be:

invisible_text (sort_order: 1) -- Detect hidden characters first
prompt_injection (sort_order: 2) -- Block injection attempts
toxicity (sort_order: 3) -- Filter toxic content
ban_substrings (sort_order: 4) -- Apply custom word filters

If a guardrail with block action triggers, subsequent guardrails are not evaluated.

Technical detailsScanner configuration details

Scanner Configuration

Each scanner type accepts a configuration object (config) for customization:

ban_substrings

{
  "substrings": ["forbidden phrase", "blocked word"],
  "case_sensitive": false
}

ban_topics

{
  "topics": ["politics", "religion", "gambling"]
}

pii

{
  "entities": ["email", "phone", "ssn", "credit_card", "address"]
}

language

{
  "languages": ["en", "es", "fr"],
  "action_on_mismatch": "block"
}

toxicity, harmful, bias, prompt_injection, secrets, invisible_text, malicious_urls, anonymize, no_refusal, sensitive

These scanners typically work with an empty configuration object {} and use their built-in detection models.

Managing Guardrails

# List guardrails for an agent
guardrails = client.guardrails.list(agent_id="your-agent-id")
 
# Update a guardrail
client.guardrails.update(
    guardrail_id="guardrail-id",
    action="log",  # Change from block to log
    is_active=True
)
 
# Disable a guardrail (without deleting)
client.guardrails.update(
    guardrail_id="guardrail-id",
    is_active=False
)
 
# Delete a guardrail
client.guardrails.delete(guardrail_id="guardrail-id")

Guardrail Traces

Every guardrail evaluation produces a guardrail span in the execution trace, recording:

Which scanner was used
Whether it triggered
What action was taken
The duration of the scan
Any details about detected content

This provides full visibility into why content was blocked or redacted.

Best Practices

Layer your guardrails -- Use multiple scanners in combination for defense in depth
Start with log mode -- Monitor what would be caught before switching to block
Prioritize injection prevention -- Always run prompt_injection on inputs
Protect PII -- Use pii or anonymize on outputs to prevent data leakage
Test with adversarial inputs -- Verify your guardrails catch edge cases
Monitor guardrail traces -- Review blocked content regularly to tune configurations

Agents -- Attaching guardrails to agents
Tracing -- Guardrail evaluation spans
Security -- Overall security architecture