Guardrails
Protect your agents with 14 built-in scanner types for input and output validation, including toxicity detection, PII filtering, and prompt injection prevention.
Guardrails
Guardrails are safety scanners that validate agent inputs and outputs before and after LLM execution. They protect against harmful content, data leakage, prompt injection, and other risks.
What Are Guardrails?
A guardrail is a scanner attached to an agent that inspects content at a specific point in the execution pipeline:
- Input guardrails scan the user's input before it reaches the LLM
- Output guardrails scan the LLM's response before it is returned to the user
Each guardrail has a scanner type, an action to take when triggered, and a sort order that determines the evaluation sequence.
Direction: Input vs Output
| Direction | When It Runs | Purpose |
|---|---|---|
input | Before LLM execution | Validate user input, block prompt injection, detect PII |
output | After LLM execution | Filter harmful responses, redact sensitive data, enforce content policies |
Scanner Types
PromptRails includes 14 built-in scanner types:
Content Safety
| Scanner | Identifier | Description |
|---|---|---|
| Toxicity | toxicity | Detects toxic, abusive, or hateful language in text |
| Harmful Content | harmful | Identifies content that promotes harm, violence, or illegal activities |
| Bias Detection | bias | Detects biased or discriminatory language |
| No Refusal | no_refusal | Ensures the LLM does not refuse to answer (output only) |
Data Protection
| Scanner | Identifier | Description |
|---|---|---|
| PII Detection | pii | Detects personally identifiable information (names, emails, phone numbers, SSNs, etc.) |
| Anonymize | anonymize | Replaces detected PII with placeholder tokens |
| Secrets Detection | secrets | Detects API keys, passwords, tokens, and other secrets in text |
| Sensitive Data | sensitive | Detects broader categories of sensitive information |
Security
| Scanner | Identifier | Description |
|---|---|---|
| Prompt Injection | prompt_injection | Detects attempts to override system instructions or inject malicious prompts |
| Invisible Text | invisible_text | Detects hidden Unicode characters or zero-width text used for injection |
| Malicious URLs | malicious_urls | Detects known malicious, phishing, or suspicious URLs |
Content Filtering
| Scanner | Identifier | Description |
|---|---|---|
| Substring Ban | ban_substrings | Blocks content containing specified banned words or phrases |
| Topic Ban | ban_topics | Blocks content related to specified banned topics |
| Language Detection | language | Ensures content is in the expected language(s) |
Actions
When a guardrail scanner triggers, it takes one of three actions:
| Action | Behavior |
|---|---|
block | Stops execution and returns an error. The LLM response is not delivered. |
redact | Removes or replaces the offending content and continues execution. |
log | Records the detection in the trace but allows execution to continue. |
Configuring Guardrails
Guardrails are configured per agent. Each agent can have multiple guardrails with different scanners, directions, and actions.
Python SDK
# Add an input guardrail for prompt injection
client.guardrails.create(
agent_id="your-agent-id",
type="input",
scanner_type="prompt_injection",
action="block",
sort_order=1,
config={}
)
# Add an output guardrail for PII
client.guardrails.create(
agent_id="your-agent-id",
type="output",
scanner_type="pii",
action="redact",
sort_order=1,
config={
"entities": ["email", "phone", "ssn", "credit_card"]
}
)
# Add a substring ban
client.guardrails.create(
agent_id="your-agent-id",
type="input",
scanner_type="ban_substrings",
action="block",
sort_order=2,
config={
"substrings": ["ignore previous instructions", "system prompt"],
"case_sensitive": False
}
)JavaScript SDK
await client.guardrails.create({
agentId: 'your-agent-id',
type: 'input',
scannerType: 'prompt_injection',
action: 'block',
sortOrder: 1,
config: {},
})
await client.guardrails.create({
agentId: 'your-agent-id',
type: 'output',
scannerType: 'pii',
action: 'redact',
sortOrder: 1,
config: {
entities: ['email', 'phone', 'ssn', 'credit_card'],
},
})Sort Order
Guardrails execute in sort order (ascending) within each direction. Lower numbers execute first.
A typical input guardrail ordering might be:
invisible_text(sort_order: 1) -- Detect hidden characters firstprompt_injection(sort_order: 2) -- Block injection attemptstoxicity(sort_order: 3) -- Filter toxic contentban_substrings(sort_order: 4) -- Apply custom word filters
If a guardrail with block action triggers, subsequent guardrails are not evaluated.
Scanner Configuration
Each scanner type accepts a configuration object (config) for customization:
ban_substrings
{
"substrings": ["forbidden phrase", "blocked word"],
"case_sensitive": false
}ban_topics
{
"topics": ["politics", "religion", "gambling"]
}pii
{
"entities": ["email", "phone", "ssn", "credit_card", "address"]
}language
{
"languages": ["en", "es", "fr"],
"action_on_mismatch": "block"
}toxicity, harmful, bias, prompt_injection, secrets, invisible_text, malicious_urls, anonymize, no_refusal, sensitive
These scanners typically work with an empty configuration object {} and use their built-in detection models.
Managing Guardrails
# List guardrails for an agent
guardrails = client.guardrails.list(agent_id="your-agent-id")
# Update a guardrail
client.guardrails.update(
guardrail_id="guardrail-id",
action="log", # Change from block to log
is_active=True
)
# Disable a guardrail (without deleting)
client.guardrails.update(
guardrail_id="guardrail-id",
is_active=False
)
# Delete a guardrail
client.guardrails.delete(guardrail_id="guardrail-id")Guardrail Traces
Every guardrail evaluation produces a guardrail span in the execution trace, recording:
- Which scanner was used
- Whether it triggered
- What action was taken
- The duration of the scan
- Any details about detected content
This provides full visibility into why content was blocked or redacted.
Best Practices
- Layer your guardrails -- Use multiple scanners in combination for defense in depth
- Start with
logmode -- Monitor what would be caught before switching toblock - Prioritize injection prevention -- Always run
prompt_injectionon inputs - Protect PII -- Use
piioranonymizeon outputs to prevent data leakage - Test with adversarial inputs -- Verify your guardrails catch edge cases
- Monitor guardrail traces -- Review blocked content regularly to tune configurations