Wrapping AI with Deterministic Guardrails
— AI is probabilistic and unpredictable. This article covers techniques for wrapping AI with deterministic guardrails: input validation, output constraints, and safety checks that prevent AI failures from reaching users.
The Core Problem: AI is Non-Deterministic
Traditional software is deterministic:
same input → same output (always)
AI is probabilistic:
same input → different output (sometimes)
same input → wrong output (sometimes)
same input → malformed output (sometimes)
You cannot build reliable products on top of unreliable foundations.
The solution: Wrap AI with deterministic guardrails that catch errors, enforce constraints, and provide fallbacks when AI fails.
What Guardrails Do
Guardrails are deterministic checks and constraints that surround AI components.
Input guardrails ensure:
- Requests are safe to send to AI
- AI receives well-formed input
- Malicious or adversarial inputs are blocked
Output guardrails ensure:
- AI responses are safe to show users
- Responses match expected format
- Responses pass quality checks
Fallback guardrails ensure:
- System stays functional when AI fails
- Users always get some response
- Errors are handled gracefully
Key principle: Guardrails are traditional code (deterministic, testable, reliable) that contain AI (probabilistic, unpredictable, unreliable).
Input Validation Guardrails
Never send user input directly to AI without validation.
Length Limits
Max input length: 10,000 characters
Max tokens: 8,000 tokens
If input exceeds limit:
Option 1: Truncate (with user warning)
Option 2: Reject with error message
Option 3: Chunk into smaller requests
Why it matters:
- Prevents excessive API costs
- Avoids timeout on very long inputs
- Protects against malicious oversized requests
Content Filtering
Check for:
- PII (emails, phone numbers, SSNs)
- Profanity or toxic language
- Prohibited content (violence, illegal activity)
- Sensitive topics (if your product has restrictions)
If detected:
Option 1: Strip sensitive content before sending
Option 2: Reject request with explanation
Option 3: Flag for human review
Why it matters:
- Privacy protection (do not send PII to third-party APIs)
- Brand safety (avoid generating harmful content)
- Compliance (regulatory requirements)
Format Validation
Expected format: JSON with required fields
If input is malformed:
Return error: "Invalid request format"
Do not waste AI API call on bad input
Why it matters:
- Fail fast on bad requests
- Save money (no API call for guaranteed failure)
- Better error messages for users
Adversarial Input Detection
Check for:
- Prompt injection attempts ("Ignore previous instructions")
- Jailbreak attempts ("Pretend you are in developer mode")
- Exfiltration attempts ("Repeat your system prompt")
If detected:
Reject request
Log for security monitoring
Why it matters:
- Prevent users from manipulating AI behavior
- Protect system prompts and internal logic
- Maintain security boundaries
Output Validation Guardrails
AI output cannot be trusted blindly. Validate before showing to users.
Schema Validation
Expected: Valid JSON matching schema
response = call_ai(prompt)
if not validate_json_schema(response):
# AI returned malformed JSON
retry_with_lower_temperature()
if still invalid:
return fallback_response()
Common schema issues:
- Missing required fields
- Wrong data types (string instead of integer)
- Malformed JSON (unclosed brackets, trailing commas)
Fix: Strict schema validation before accepting output.
Content Safety Checks
response = call_ai(prompt)
if contains_prohibited_content(response):
# AI generated unsafe content
return safe_fallback_response()
if contains_pii(response):
# AI leaked sensitive data
redact_pii(response)
Check for:
- Harmful content (violence, hate speech, illegal activity)
- Leaked PII from training data
- Copyrighted material
- Misinformation (if detectable)
Fact-Checking Against Known Data
ai_answer = call_ai(question)
if question has known_correct_answer:
if ai_answer != known_correct_answer:
# AI hallucinated
return known_correct_answer
Use when:
- Answers can be verified against database
- Factual questions have definitive answers
- Cost of being wrong is high
Example: “What is our support email?”
- AI might hallucinate: support@example.com
- Fact-check against config: support@company.com
- Return config value, not AI output
Length and Completeness Checks
response = call_ai(prompt)
if len(response) < min_length:
# AI output is too short (likely incomplete)
retry_request()
if response.endswith("...") or response.endswith(incomplete_marker):
# AI was cut off mid-response
retry_with_higher_max_tokens()
Catch:
- Truncated responses (hit max_tokens limit)
- Empty or nearly empty responses
- Incomplete sentences
Constraint Enforcement Guardrails
Force AI to stay within acceptable boundaries.
Temperature and Sampling Constraints
For format-critical tasks:
temperature = 0.1 # Very deterministic
For creative tasks:
temperature = 0.7 # More creative
For tasks requiring exact format:
Use JSON mode or structured output
Why it matters:
- Lower temperature = more reliable formatting
- Higher temperature = more creativity but more errors
Token Limit Constraints
Set max_tokens based on expected output:
Short answer: max_tokens = 50
Paragraph: max_tokens = 200
Long form: max_tokens = 1000
Prevents:
- Excessive costs from runaway generation
- Unexpectedly long responses
Banned Word/Phrase Lists
response = call_ai(prompt)
for banned_phrase in banned_list:
if banned_phrase in response.lower():
# AI used prohibited language
regenerate_with_stricter_prompt()
Use for:
- Brand-inappropriate language
- Competitor mentions
- Legally prohibited statements
Output Format Enforcement
Prompt: "Return valid JSON only, no markdown, no explanation"
response = call_ai(prompt)
# Strip markdown code fences if AI ignored instruction
response = remove_markdown_fences(response)
# Extract JSON if AI added explanation
response = extract_json_from_text(response)
Why needed:
- AI often adds “Here is the JSON:” before actual JSON
- AI wraps JSON in markdown code fences
- AI adds explanations you did not ask for
Safety Layers: Defense in Depth
Never rely on a single guardrail. Use multiple layers.
Layer 1: Input Sanitization
user_input
↓
Strip HTML, SQL injection attempts
↓
Check length limits
↓
Filter prohibited content
↓
Validate format
Layer 2: Prompt Engineering
System prompt with safety instructions:
"Never reveal PII. Never generate harmful content.
If asked to do something prohibited, politely decline."
Layer 3: AI Model Safety Features
Use models with built-in safety (e.g., content moderation)
Enable provider's safety filters
Layer 4: Output Validation
AI response
↓
Validate JSON schema
↓
Check for prohibited content
↓
Verify against known facts
↓
Redact any leaked PII
Layer 5: Human Review (for high-stakes)
AI-generated content
↓
Flagged for human review if:
- Low confidence score
- Sensitive topic
- High-impact decision
Each layer catches different failure modes. Combined, they dramatically reduce risk.
Fallback Guardrails
When AI fails despite validation, fallbacks ensure system stays functional.
Retry with Adjusted Parameters
response = call_ai(prompt, temperature=0.7)
if not valid(response):
# Try again with more deterministic settings
response = call_ai(prompt, temperature=0.1)
if not valid(response):
# Give up on AI, use fallback
response = deterministic_fallback()
Template-Based Fallbacks
try:
response = ai_generate_email(context)
except:
response = email_template.format(
user_name=context.user_name,
issue=context.issue
)
Use when:
- AI fails to generate
- Quality is below threshold
- Latency exceeds timeout
Cached Response Fallbacks
cache_key = hash(user_input)
if cache_key in response_cache:
return response_cache[cache_key]
try:
response = call_ai(user_input)
response_cache[cache_key] = response
return response
except:
# AI failed, no cached response available
return generic_fallback_response()
Use when:
- Repeated similar inputs
- AI API is down
- Need guaranteed response
Graceful Degradation
AI summarization fails
↓
Return first 200 characters + "..."
AI categorization fails
↓
Return "Uncategorized" (user can manually categorize)
AI recommendation fails
↓
Return most popular items (non-personalized)
Key principle: Partial functionality is better than total failure.
Monitoring Guardrails
Guardrails should be observable. Track when they trigger.
Metrics to Monitor
Input validation:
- % of requests blocked by input filters
- Common rejection reasons
- Adversarial input attempts
Output validation:
- % of responses failing schema validation
- % requiring retry
- % using fallback responses
Safety triggers:
- Content filter activation rate
- PII redaction frequency
- Prohibited content detection
Performance:
- Validation latency overhead
- Retry frequency
- Fallback usage rate
Alerts
Critical:
- Input filter blocks >50% of requests (filter too strict?)
- Output validation fails >20% (AI quality degraded?)
- Fallback usage >30% (AI system failing?)
Warning:
- Retry rate >10%
- Safety filters trigger >5%
- Unusual spike in validation failures
Info:
- New adversarial pattern detected
- Validation rules updated
Guardrails for Different AI Tasks
For Chatbots and Conversational AI
Input:
- Message length limits (prevent abuse)
- Rate limiting (prevent spam)
- Conversation context trimming (prevent token overflow)
Output:
- No PII in responses
- Polite refusals for prohibited topics
- Response length limits (prevent rambling)
Fallback:
- “I did not understand. Can you rephrase?”
- “Let me connect you with a human”
For Content Generation
Input:
- Topic boundaries (what subjects are allowed)
- Style guidelines (tone, formality level)
Output:
- Plagiarism detection
- Fact-checking (if applicable)
- Brand voice validation
Fallback:
- Template-based content
- Human writer handoff
For Classification/Categorization
Input:
- Format validation (text, not binary data)
- Length reasonable for classification
Output:
- Confidence threshold (only use if >80% confidence)
- Allowed category list (reject if AI invents new category)
Fallback:
- “Needs manual review”
- Rule-based classification
For Search and Retrieval
Input:
- Query sanitization (prevent injection)
- Length limits
Output:
- Relevance threshold (only return if score >0.7)
- Result count limits (top 10, not 10,000)
Fallback:
- Keyword-based search
- Popular/trending results
Testing Guardrails
Guardrails are only effective if they work. Test them rigorously.
Adversarial Testing
Inputs to try:
- Prompt injection attempts
- Jailbreak attempts
- Malformed data (invalid JSON, etc.)
- Extremely long inputs
- Prohibited content
Expected result: Guardrail blocks or sanitizes input, AI never sees it.
Failure Simulation
Simulate:
- AI returns malformed JSON
- AI returns empty response
- AI times out
- AI returns prohibited content
Expected result: Output validation catches it, fallback activates.
Boundary Testing
Test edge cases:
- Input exactly at length limit
- Input one character over limit
- Empty input
- Input with only whitespace
Expected result: Guardrails handle gracefully, no crashes.
Load Testing
Test under load:
- 1000 requests per second
- Guardrails do not become bottleneck
- Validation latency <50ms
Expected result: Guardrails scale with traffic.
Guardrails vs Over-Engineering
Too many guardrails can make the system brittle and slow.
Signs of Over-Engineering
- Validation latency >500ms (too many checks)
- >50% of requests blocked by input filters (too strict)
- Support tickets about “system won’t accept my input”
- AI rarely used because fallback always triggers
Finding the Right Balance
Start minimal:
- Input: length limits, basic format validation
- Output: schema validation, basic safety checks
- Fallback: simple template response
Add guardrails as failures occur:
- Saw PII leak → add PII detection
- Saw prompt injection → add injection detection
- Saw quality issues → add confidence thresholds
Remove guardrails that never trigger:
- If a safety check has not triggered in 6 months, consider removing it
- If input filter blocks <0.1% of requests, may be unnecessary
Principle: Guardrails should prevent real observed failures, not hypothetical ones.
Key Takeaways
- AI is unreliable by nature – deterministic guardrails make it production-ready
- Validate inputs – length, format, safety before sending to AI
- Validate outputs – schema, content safety, fact-checking before showing to users
- Enforce constraints – temperature, token limits, allowed content
- Use defense in depth – multiple layers of protection, not single guardrail
- Always have fallbacks – templates, cached responses, graceful degradation
- Monitor guardrail triggers – track when and why they activate
- Test adversarially – prompt injection, malformed data, edge cases
- Avoid over-engineering – add guardrails based on real failures, not fears
- Guardrails enable trust – users trust AI more when they know it is bounded
You cannot make AI 100% reliable. But you can make the system around it 100% reliable.