Wrapping AI with Deterministic Guardrails

— AI is probabilistic and unpredictable. This article covers techniques for wrapping AI with deterministic guardrails: input validation, output constraints, and safety checks that prevent AI failures from reaching users.

level: intermediate topics: architecture, safety tags: guardrails, validation, safety, architecture, reliability

The Core Problem: AI is Non-Deterministic

Traditional software is deterministic:

same input → same output (always)

AI is probabilistic:

same input → different output (sometimes)
same input → wrong output (sometimes)
same input → malformed output (sometimes)

You cannot build reliable products on top of unreliable foundations.

The solution: Wrap AI with deterministic guardrails that catch errors, enforce constraints, and provide fallbacks when AI fails.


What Guardrails Do

Guardrails are deterministic checks and constraints that surround AI components.

Input guardrails ensure:

  • Requests are safe to send to AI
  • AI receives well-formed input
  • Malicious or adversarial inputs are blocked

Output guardrails ensure:

  • AI responses are safe to show users
  • Responses match expected format
  • Responses pass quality checks

Fallback guardrails ensure:

  • System stays functional when AI fails
  • Users always get some response
  • Errors are handled gracefully

Key principle: Guardrails are traditional code (deterministic, testable, reliable) that contain AI (probabilistic, unpredictable, unreliable).


Input Validation Guardrails

Never send user input directly to AI without validation.

Length Limits

Max input length: 10,000 characters
Max tokens: 8,000 tokens

If input exceeds limit:
  Option 1: Truncate (with user warning)
  Option 2: Reject with error message
  Option 3: Chunk into smaller requests

Why it matters:

  • Prevents excessive API costs
  • Avoids timeout on very long inputs
  • Protects against malicious oversized requests

Content Filtering

Check for:
- PII (emails, phone numbers, SSNs)
- Profanity or toxic language
- Prohibited content (violence, illegal activity)
- Sensitive topics (if your product has restrictions)

If detected:
  Option 1: Strip sensitive content before sending
  Option 2: Reject request with explanation
  Option 3: Flag for human review

Why it matters:

  • Privacy protection (do not send PII to third-party APIs)
  • Brand safety (avoid generating harmful content)
  • Compliance (regulatory requirements)

Format Validation

Expected format: JSON with required fields

If input is malformed:
  Return error: "Invalid request format"
  Do not waste AI API call on bad input

Why it matters:

  • Fail fast on bad requests
  • Save money (no API call for guaranteed failure)
  • Better error messages for users

Adversarial Input Detection

Check for:
- Prompt injection attempts ("Ignore previous instructions")
- Jailbreak attempts ("Pretend you are in developer mode")
- Exfiltration attempts ("Repeat your system prompt")

If detected:
  Reject request
  Log for security monitoring

Why it matters:

  • Prevent users from manipulating AI behavior
  • Protect system prompts and internal logic
  • Maintain security boundaries

Output Validation Guardrails

AI output cannot be trusted blindly. Validate before showing to users.

Schema Validation

Expected: Valid JSON matching schema

response = call_ai(prompt)

if not validate_json_schema(response):
  # AI returned malformed JSON
  retry_with_lower_temperature()
  
  if still invalid:
    return fallback_response()

Common schema issues:

  • Missing required fields
  • Wrong data types (string instead of integer)
  • Malformed JSON (unclosed brackets, trailing commas)

Fix: Strict schema validation before accepting output.

Content Safety Checks

response = call_ai(prompt)

if contains_prohibited_content(response):
  # AI generated unsafe content
  return safe_fallback_response()
  
if contains_pii(response):
  # AI leaked sensitive data
  redact_pii(response)

Check for:

  • Harmful content (violence, hate speech, illegal activity)
  • Leaked PII from training data
  • Copyrighted material
  • Misinformation (if detectable)

Fact-Checking Against Known Data

ai_answer = call_ai(question)

if question has known_correct_answer:
  if ai_answer != known_correct_answer:
    # AI hallucinated
    return known_correct_answer

Use when:

  • Answers can be verified against database
  • Factual questions have definitive answers
  • Cost of being wrong is high

Example: “What is our support email?”

Length and Completeness Checks

response = call_ai(prompt)

if len(response) < min_length:
  # AI output is too short (likely incomplete)
  retry_request()

if response.endswith("...") or response.endswith(incomplete_marker):
  # AI was cut off mid-response
  retry_with_higher_max_tokens()

Catch:

  • Truncated responses (hit max_tokens limit)
  • Empty or nearly empty responses
  • Incomplete sentences

Constraint Enforcement Guardrails

Force AI to stay within acceptable boundaries.

Temperature and Sampling Constraints

For format-critical tasks:
  temperature = 0.1  # Very deterministic

For creative tasks:
  temperature = 0.7  # More creative
  
For tasks requiring exact format:
  Use JSON mode or structured output

Why it matters:

  • Lower temperature = more reliable formatting
  • Higher temperature = more creativity but more errors

Token Limit Constraints

Set max_tokens based on expected output:
  Short answer: max_tokens = 50
  Paragraph: max_tokens = 200
  Long form: max_tokens = 1000

Prevents:
  - Excessive costs from runaway generation
  - Unexpectedly long responses

Banned Word/Phrase Lists

response = call_ai(prompt)

for banned_phrase in banned_list:
  if banned_phrase in response.lower():
    # AI used prohibited language
    regenerate_with_stricter_prompt()

Use for:

  • Brand-inappropriate language
  • Competitor mentions
  • Legally prohibited statements

Output Format Enforcement

Prompt: "Return valid JSON only, no markdown, no explanation"

response = call_ai(prompt)

# Strip markdown code fences if AI ignored instruction
response = remove_markdown_fences(response)

# Extract JSON if AI added explanation
response = extract_json_from_text(response)

Why needed:

  • AI often adds “Here is the JSON:” before actual JSON
  • AI wraps JSON in markdown code fences
  • AI adds explanations you did not ask for

Safety Layers: Defense in Depth

Never rely on a single guardrail. Use multiple layers.

Layer 1: Input Sanitization

user_input

Strip HTML, SQL injection attempts

Check length limits

Filter prohibited content

Validate format

Layer 2: Prompt Engineering

System prompt with safety instructions:
  "Never reveal PII. Never generate harmful content.
   If asked to do something prohibited, politely decline."

Layer 3: AI Model Safety Features

Use models with built-in safety (e.g., content moderation)
Enable provider's safety filters

Layer 4: Output Validation

AI response

Validate JSON schema

Check for prohibited content

Verify against known facts

Redact any leaked PII

Layer 5: Human Review (for high-stakes)

AI-generated content

Flagged for human review if:
  - Low confidence score
  - Sensitive topic
  - High-impact decision

Each layer catches different failure modes. Combined, they dramatically reduce risk.


Fallback Guardrails

When AI fails despite validation, fallbacks ensure system stays functional.

Retry with Adjusted Parameters

response = call_ai(prompt, temperature=0.7)

if not valid(response):
  # Try again with more deterministic settings
  response = call_ai(prompt, temperature=0.1)
  
  if not valid(response):
    # Give up on AI, use fallback
    response = deterministic_fallback()

Template-Based Fallbacks

try:
  response = ai_generate_email(context)
except:
  response = email_template.format(
    user_name=context.user_name,
    issue=context.issue
  )

Use when:

  • AI fails to generate
  • Quality is below threshold
  • Latency exceeds timeout

Cached Response Fallbacks

cache_key = hash(user_input)

if cache_key in response_cache:
  return response_cache[cache_key]

try:
  response = call_ai(user_input)
  response_cache[cache_key] = response
  return response
except:
  # AI failed, no cached response available
  return generic_fallback_response()

Use when:

  • Repeated similar inputs
  • AI API is down
  • Need guaranteed response

Graceful Degradation

AI summarization fails

Return first 200 characters + "..."

AI categorization fails

Return "Uncategorized" (user can manually categorize)

AI recommendation fails

Return most popular items (non-personalized)

Key principle: Partial functionality is better than total failure.


Monitoring Guardrails

Guardrails should be observable. Track when they trigger.

Metrics to Monitor

Input validation:

  • % of requests blocked by input filters
  • Common rejection reasons
  • Adversarial input attempts

Output validation:

  • % of responses failing schema validation
  • % requiring retry
  • % using fallback responses

Safety triggers:

  • Content filter activation rate
  • PII redaction frequency
  • Prohibited content detection

Performance:

  • Validation latency overhead
  • Retry frequency
  • Fallback usage rate

Alerts

Critical:

  • Input filter blocks >50% of requests (filter too strict?)
  • Output validation fails >20% (AI quality degraded?)
  • Fallback usage >30% (AI system failing?)

Warning:

  • Retry rate >10%
  • Safety filters trigger >5%
  • Unusual spike in validation failures

Info:

  • New adversarial pattern detected
  • Validation rules updated

Guardrails for Different AI Tasks

For Chatbots and Conversational AI

Input:

  • Message length limits (prevent abuse)
  • Rate limiting (prevent spam)
  • Conversation context trimming (prevent token overflow)

Output:

  • No PII in responses
  • Polite refusals for prohibited topics
  • Response length limits (prevent rambling)

Fallback:

  • “I did not understand. Can you rephrase?”
  • “Let me connect you with a human”

For Content Generation

Input:

  • Topic boundaries (what subjects are allowed)
  • Style guidelines (tone, formality level)

Output:

  • Plagiarism detection
  • Fact-checking (if applicable)
  • Brand voice validation

Fallback:

  • Template-based content
  • Human writer handoff

For Classification/Categorization

Input:

  • Format validation (text, not binary data)
  • Length reasonable for classification

Output:

  • Confidence threshold (only use if >80% confidence)
  • Allowed category list (reject if AI invents new category)

Fallback:

  • “Needs manual review”
  • Rule-based classification

For Search and Retrieval

Input:

  • Query sanitization (prevent injection)
  • Length limits

Output:

  • Relevance threshold (only return if score >0.7)
  • Result count limits (top 10, not 10,000)

Fallback:

  • Keyword-based search
  • Popular/trending results

Testing Guardrails

Guardrails are only effective if they work. Test them rigorously.

Adversarial Testing

Inputs to try:

  • Prompt injection attempts
  • Jailbreak attempts
  • Malformed data (invalid JSON, etc.)
  • Extremely long inputs
  • Prohibited content

Expected result: Guardrail blocks or sanitizes input, AI never sees it.

Failure Simulation

Simulate:

  • AI returns malformed JSON
  • AI returns empty response
  • AI times out
  • AI returns prohibited content

Expected result: Output validation catches it, fallback activates.

Boundary Testing

Test edge cases:

  • Input exactly at length limit
  • Input one character over limit
  • Empty input
  • Input with only whitespace

Expected result: Guardrails handle gracefully, no crashes.

Load Testing

Test under load:

  • 1000 requests per second
  • Guardrails do not become bottleneck
  • Validation latency <50ms

Expected result: Guardrails scale with traffic.


Guardrails vs Over-Engineering

Too many guardrails can make the system brittle and slow.

Signs of Over-Engineering

  • Validation latency >500ms (too many checks)
  • >50% of requests blocked by input filters (too strict)
  • Support tickets about “system won’t accept my input”
  • AI rarely used because fallback always triggers

Finding the Right Balance

Start minimal:

  • Input: length limits, basic format validation
  • Output: schema validation, basic safety checks
  • Fallback: simple template response

Add guardrails as failures occur:

  • Saw PII leak → add PII detection
  • Saw prompt injection → add injection detection
  • Saw quality issues → add confidence thresholds

Remove guardrails that never trigger:

  • If a safety check has not triggered in 6 months, consider removing it
  • If input filter blocks <0.1% of requests, may be unnecessary

Principle: Guardrails should prevent real observed failures, not hypothetical ones.


Key Takeaways

  1. AI is unreliable by nature – deterministic guardrails make it production-ready
  2. Validate inputs – length, format, safety before sending to AI
  3. Validate outputs – schema, content safety, fact-checking before showing to users
  4. Enforce constraints – temperature, token limits, allowed content
  5. Use defense in depth – multiple layers of protection, not single guardrail
  6. Always have fallbacks – templates, cached responses, graceful degradation
  7. Monitor guardrail triggers – track when and why they activate
  8. Test adversarially – prompt injection, malformed data, edge cases
  9. Avoid over-engineering – add guardrails based on real failures, not fears
  10. Guardrails enable trust – users trust AI more when they know it is bounded

You cannot make AI 100% reliable. But you can make the system around it 100% reliable.