Guardrails
Protect agent runs with checks that block, redact, or log unsafe input and output.
Guardrails are scanners that run before or after a model call. They help protect agent workflows from prompt injection, sensitive data exposure, unsafe content, unwanted topics, and other policy risks.
What Are Guardrails?
A guardrail is attached to an agent and inspects content at a specific point in the execution pipeline:
- Input guardrails scan the user's input before it reaches the LLM
- Output guardrails scan the LLM's response before it is returned to the user
Each guardrail has a scanner type, an action to take when triggered, and a sort order that determines the evaluation sequence.
Direction: Input vs Output
| Direction | When It Runs | Purpose |
|---|---|---|
input | Before LLM execution | Validate user input, block prompt injection, detect PII |
output | After LLM execution | Filter harmful responses, redact sensitive data, enforce content policies |
Technical detailsScanner reference
Scanner Types
PromptRails includes 14 built-in scanner types:
Content Safety
| Scanner | Identifier | Description |
|---|---|---|
| Toxicity | toxicity | Detects toxic, abusive, or hateful language in text |
| Harmful Content | harmful | Identifies content that promotes harm, violence, or illegal activities |
| Bias Detection | bias | Detects biased or discriminatory language |
| No Refusal | no_refusal | Ensures the LLM does not refuse to answer (output only) |
Data Protection
| Scanner | Identifier | Description |
|---|---|---|
| PII Detection | pii | Detects personally identifiable information (names, emails, phone numbers, SSNs, etc.) |
| Anonymize | anonymize | Replaces detected PII with placeholder tokens |
| Secrets Detection | secrets | Detects API keys, passwords, tokens, and other secrets in text |
| Sensitive Data | sensitive | Detects broader categories of sensitive information |
Security
| Scanner | Identifier | Description |
|---|---|---|
| Prompt Injection | prompt_injection | Detects attempts to override system instructions or inject malicious prompts |
| Invisible Text | invisible_text | Detects hidden Unicode characters or zero-width text used for injection |
| Malicious URLs | malicious_urls | Detects known malicious, phishing, or suspicious URLs |
Content Filtering
| Scanner | Identifier | Description |
|---|---|---|
| Substring Ban | ban_substrings | Blocks content containing specified banned words or phrases |
| Topic Ban | ban_topics | Blocks content related to specified banned topics |
| Language Detection | language | Ensures content is in the expected language(s) |
Actions
When a guardrail scanner triggers, it takes one of three actions:
| Action | Behavior |
|---|---|
block | Stops execution and returns an error. The LLM response is not delivered. |
redact | Removes or replaces the offending content and continues execution. |
log | Records the detection in the trace but allows execution to continue. |
Configuring Guardrails
Guardrails are configured per agent. Each agent can have multiple guardrails with different scanners, directions, and actions.
Use Studio for day-to-day configuration. Use the SDK when guardrail setup needs to be generated, synced, or promoted by an internal release workflow.
Technical detailsConfigure guardrails with SDKs
Python SDK
# Add an input guardrail for prompt injection
client.guardrails.create(
agent_id="your-agent-id",
type="input",
scanner_type="prompt_injection",
action="block",
sort_order=1,
config={}
)
# Add an output guardrail for PII
client.guardrails.create(
agent_id="your-agent-id",
type="output",
scanner_type="pii",
action="redact",
sort_order=1,
config={
"entities": ["email", "phone", "ssn", "credit_card"]
}
)
# Add a substring ban
client.guardrails.create(
agent_id="your-agent-id",
type="input",
scanner_type="ban_substrings",
action="block",
sort_order=2,
config={
"substrings": ["ignore previous instructions", "system prompt"],
"case_sensitive": False
}
)JavaScript SDK
await client.guardrails.create({
agentId: 'your-agent-id',
type: 'input',
scannerType: 'prompt_injection',
action: 'block',
sortOrder: 1,
config: {},
})
await client.guardrails.create({
agentId: 'your-agent-id',
type: 'output',
scannerType: 'pii',
action: 'redact',
sortOrder: 1,
config: {
entities: ['email', 'phone', 'ssn', 'credit_card'],
},
})Sort Order
Guardrails execute in sort order (ascending) within each direction. Lower numbers execute first.
A typical input guardrail ordering might be:
invisible_text(sort_order: 1) -- Detect hidden characters firstprompt_injection(sort_order: 2) -- Block injection attemptstoxicity(sort_order: 3) -- Filter toxic contentban_substrings(sort_order: 4) -- Apply custom word filters
If a guardrail with block action triggers, subsequent guardrails are not evaluated.
Technical detailsScanner configuration details
Scanner Configuration
Each scanner type accepts a configuration object (config) for customization:
ban_substrings
{
"substrings": ["forbidden phrase", "blocked word"],
"case_sensitive": false
}ban_topics
{
"topics": ["politics", "religion", "gambling"]
}pii
{
"entities": ["email", "phone", "ssn", "credit_card", "address"]
}language
{
"languages": ["en", "es", "fr"],
"action_on_mismatch": "block"
}toxicity, harmful, bias, prompt_injection, secrets, invisible_text, malicious_urls, anonymize, no_refusal, sensitive
These scanners typically work with an empty configuration object {} and use their built-in detection models.
Managing Guardrails
# List guardrails for an agent
guardrails = client.guardrails.list(agent_id="your-agent-id")
# Update a guardrail
client.guardrails.update(
guardrail_id="guardrail-id",
action="log", # Change from block to log
is_active=True
)
# Disable a guardrail (without deleting)
client.guardrails.update(
guardrail_id="guardrail-id",
is_active=False
)
# Delete a guardrail
client.guardrails.delete(guardrail_id="guardrail-id")Guardrail Traces
Every guardrail evaluation produces a guardrail span in the execution trace, recording:
- Which scanner was used
- Whether it triggered
- What action was taken
- The duration of the scan
- Any details about detected content
This provides full visibility into why content was blocked or redacted.
Best Practices
- Layer your guardrails -- Use multiple scanners in combination for defense in depth
- Start with
logmode -- Monitor what would be caught before switching toblock - Prioritize injection prevention -- Always run
prompt_injectionon inputs - Protect PII -- Use
piioranonymizeon outputs to prevent data leakage - Test with adversarial inputs -- Verify your guardrails catch edge cases
- Monitor guardrail traces -- Review blocked content regularly to tune configurations