Handling PII and Sensitive Data in LLM Systems

— LLMs process user data, and that data often includes personally identifiable information. Mishandle it, and you violate regulations and lose user trust. Here's how to do it right.

level: intermediate topics: security, privacy, compliance tags: pii, privacy, gdpr, compliance, security

Your customer support chatbot sees users’ names, email addresses, phone numbers, and account details. Your AI coding assistant processes proprietary source code. Your medical AI handles health records.

All of this is sensitive data. If it leaks, you face legal liability (GDPR, HIPAA, CCPA violations), reputational damage, and loss of user trust.

LLM systems create new privacy challenges because they send data to third-party APIs, log extensively for debugging, and produce outputs that might inadvertently contain sensitive information.

Here’s how to handle it responsibly.

Understanding What Counts as PII

Directly identifying information: Names, email addresses, phone numbers, social security numbers, passport numbers. Obvious PII.

Indirectly identifying information: IP addresses, device IDs, location data. Alone they might not identify someone, but combined with other data, they can.

Sensitive categories: Health information, financial data, biometric data, children’s data. These have stricter protections under regulations like HIPAA and COPPA.

Contextual PII: “The user from accounting who filed an HR complaint” might be identifying even without a name.

Assume that anything users provide could be PII. Design your system to protect it.

The LLM Provider Data Problem

When you send data to OpenAI, Anthropic, or Google, you’re sharing it with a third party. What happens to it?

Training data concerns: Most providers now offer options to opt out of training on your data, but you must enable this explicitly. Check your provider’s data usage policies.

Data retention: How long does the provider keep your data? Some keep it for 30 days for abuse monitoring, others delete it immediately (if you’re on enterprise plans).

Data location: Where are the servers? If you’re in the EU and data is processed in the US, that might violate GDPR unless the provider has appropriate safeguards (like Standard Contractual Clauses).

Subprocessors: Does your provider use subprocessors (other companies that handle data on their behalf)? You’re responsible for their practices too.

Read provider agreements carefully. Use enterprise plans with stronger privacy guarantees for sensitive use cases.

Data Minimization

The best way to protect PII is to not collect it in the first place.

Collect only what’s needed: Don’t ask for user email if you don’t need it. Don’t process full documents if summaries work.

Strip PII before processing: If you’re summarizing support tickets, remove names and contact info first. The LLM doesn’t need them.

Use pseudonymization: Replace real names with “User A” and “User B” before sending to the LLM. Re-map on the way out if needed.

Synthetic data for testing: Use fake data in development and testing. Never test with real user data unless absolutely necessary.

PII Detection and Filtering

Before sending data to an LLM, scan it for PII.

Pattern matching: Use regex to detect email addresses, phone numbers, credit card numbers, SSNs.

Named Entity Recognition (NER): Use NER models to detect names, locations, organizations. These are more sophisticated than regex but still imperfect.

Redaction: Replace detected PII with placeholders. “My email is john@example.com” becomes “My email is [EMAIL].”

Challenges: False positives (flagging “The company Apple” as a person name) and false negatives (missing PII in unusual formats) are common. Test thoroughly.

On-Premise and Self-Hosted Models

If you can’t send sensitive data to third-party APIs, host models yourself.

Pros: Full control over data. No third-party exposure. Compliance-friendly.

Cons: Infrastructure cost, operational complexity, and you’re responsible for model security and updates.

When it’s necessary: Healthcare (HIPAA), finance (PCI-DSS), government (FedRAMP), or any domain with strict data residency requirements.

Hybrid approach: Use external APIs for non-sensitive tasks, self-hosted models for sensitive ones.

Logging and Debugging Trade-offs

You need logs to debug LLM systems. But logs often contain PII.

Selective logging: Log requests without PII (metadata, token counts, model versions) and only log full prompts when debugging specific issues.

Log redaction: Automatically redact PII from logs before storing. This is imperfect but better than nothing.

Short retention: Delete detailed logs after a few days. Keep aggregated, anonymized metrics longer.

Access controls: Limit who can access logs containing PII. Audit log access.

Secure storage: Encrypt logs at rest and in transit. Use dedicated, compliant storage for sensitive data.

Users have a right to know what you’re doing with their data.

Clear disclosures: Tell users that their input will be processed by an LLM and potentially sent to third-party providers.

Opt-in for sensitive features: Don’t default to sending data to LLMs without user awareness. Let them choose.

Data access requests: Under GDPR and CCPA, users can request copies of their data. Can you retrieve and provide it? Design systems with this in mind.

Data deletion requests: Users can request deletion. Can you delete their data from logs, databases, and caches? Ensure you have deletion workflows.

Output Privacy Risks

LLMs can leak PII in outputs.

Regurgitation: The LLM might repeat PII from the input in its response. “You asked about John Doe at john@example.com” leaks the email.

Cross-contamination: In multi-tenant systems, ensure user A’s data never appears in user B’s responses. Strict session isolation is critical.

Unintentional disclosure: The LLM might infer and state sensitive information. “Based on your medical history…” reveals that the user has a medical history, even if not explicitly mentioned.

Output scanning: Check LLM outputs for PII before showing to users. Redact if found.

Compliance Frameworks

Different regulations have different requirements.

GDPR (EU): Requires lawful basis for processing (consent, legitimate interest), data minimization, user rights (access, deletion), and data processing agreements with third parties.

HIPAA (US healthcare): Requires Business Associate Agreements (BAAs) with LLM providers, encryption, audit logs, and access controls.

CCPA (California): Requires transparency about data use, opt-out rights, and deletion on request.

COPPA (children’s data): Requires parental consent for processing data of children under 13.

Work with legal and compliance teams to understand which regulations apply to your use case and ensure your LLM system meets requirements.

Anonymization vs. Pseudonymization

Anonymization: Removing identifying information such that the data can never be re-identified. Extremely hard to do correctly. If you can re-identify the data later (for example, via a mapping table), it’s not truly anonymized.

Pseudonymization: Replacing identifiers with pseudonyms (like “User A”). This reduces risk but still counts as personal data under GDPR.

Most LLM use cases need pseudonymization (so you can map responses back to users), not full anonymization. Understand the distinction and legal implications.

Data Processing Agreements

If you use third-party LLM providers, you need Data Processing Agreements (DPAs).

What they cover: The provider’s obligations for handling your data, security measures, breach notification, data deletion, and compliance with regulations.

Enterprise plans: Most providers offer DPAs on enterprise plans but not free or basic plans. Budget accordingly.

Review carefully: Don’t just sign boilerplate agreements. Ensure they meet your compliance requirements.

Incident Response for Data Breaches

Despite best efforts, breaches happen.

Detection: Monitor for unusual data access patterns, unauthorized API calls, or leaked data in outputs.

Containment: Immediately revoke access, disable compromised credentials, and halt affected systems.

Notification: GDPR requires notifying authorities within 72 hours of discovering a breach. HIPAA and CCPA have similar requirements.

Remediation: Fix the vulnerability, assess the scope of exposure, and notify affected users.

Have an incident response plan before you need it.

Special Considerations for Multi-Tenant Systems

If you serve multiple customers (SaaS model), tenant isolation is critical.

Strict data separation: Customer A’s data must never mix with Customer B’s data. Use separate databases, separate LLM sessions, or strong logical partitioning.

Session isolation: Each user session should be isolated. Clear conversation history between sessions.

Access control testing: Regularly test that users can’t access other users’ data through prompt manipulation or other exploits.

Principle of Least Exposure

Minimize scope: Only send to the LLM the minimum data needed to answer the query. Don’t send entire databases when a summary would suffice.

Temporary processing: If possible, process data transiently (in-memory) and discard immediately after generating a response.

No long-term storage in prompts: Don’t persist user data in system prompts or long-term agent memory unless absolutely necessary.

What Good Looks Like

A privacy-conscious LLM system:

  • Uses data minimization and PII redaction before LLM processing
  • Chooses LLM providers with strong privacy commitments and DPAs
  • Logs selectively and redacts PII from logs
  • Implements output scanning to prevent PII leakage
  • Provides transparency to users about data use
  • Supports user rights (access, deletion)
  • Complies with relevant regulations (GDPR, HIPAA, CCPA)
  • Has incident response plans for breaches
  • Regularly audits data handling practices

Privacy isn’t a feature you add at the end. It’s a fundamental design constraint. Build it in from the start, and you’ll avoid painful compliance issues and user trust breaches later.

Treat user data as toxic—minimize contact, handle with care, and dispose of properly. Your users—and your legal team—will thank you.