Back to Research
Cernis

New Model

Privacy-First PII Detection

Announcing Sentinel-PII: A Fast, Accurate PII Detection Model

November 2, 2025Announcement

Announcing Sentinel-PII: A Fast, Accurate Open Source PII Detection Model Built on IBM Granite 4.0

Privacy protection is critical in today's data-driven world. Whether you're processing customer support tickets, medical records, or financial documents, identifying and redacting personally identifiable information (PII) is essential for compliance with regulations like GDPR, HIPAA, and CCPA.

We set out to create a lightweight, accurate PII detection model that could handle diverse document types while being fast enough for production use. The result is Sentinel-PII, a specialized model fine-tuned on IBM's Granite 4.0 architecture that identifies and tags 20 different categories of sensitive information.

Technical architecture

Sentinel-PII was trained on a carefully curated dataset combining real-world examples from the ai4privacy/pii-masking-300k dataset with synthetically generated examples covering edge cases and diverse formats.

Sentinel-PII is built on IBM's Granite 4.0 Hybrid Micro model, which combines:

  • Mamba state-space layers: For efficient long-range context modeling
  • Transformer attention: For precise token-level predictions
  • Shared MLP layers: For parameter efficiency

Key innovations: Traditional PII detection → Sentinel-PII

We made several notable improvements over traditional rule-based and regex-based PII detection approaches:

Comprehensive PII coverage: Sentinel-PII detects 20 categories of PII including names, addresses, phone numbers, email addresses, credit card information, medical conditions, passwords, and more — all in a single unified model.

Context-aware detection: Unlike regex-based systems, Sentinel-PII understands context. It can distinguish between "John Smith" as a person's name versus a company name, and recognizes PII even when formatted in unusual ways.

Parameter-efficient fine-tuning: Using LoRA (Low-Rank Adaptation) on Granite 4.0's hybrid architecture, we fine-tuned only a small fraction of parameters while achieving high accuracy across all PII categories.

Standardized output format: All PII is tagged using a consistent [PII:CATEGORY] format, making it easy to integrate into existing data pipelines and redaction workflows.

Fast convergence: The model achieved strong performance in a single training epoch, demonstrating efficient learning across diverse PII types and document formats.

Model comparison: Speed vs. performance tradeoffs

Across a variety of test documents, Sentinel-PII showed strong detection accuracy while maintaining fast inference speeds suitable for production use.

Example 1: Customer support interaction

Sentinel-PII successfully identifies multiple PII types in a typical customer support message, including names, contact information, and account details.

Input: "My name is John Smith and I live at 123 Main St. Email: john@email.com, Phone: (555) 123-4567"

Output: "My name is [PERSON_NAME] and I live at [STREET_ADDRESS]. Email: [EMAIL_ADDRESS], Phone: [PHONE_NUMBER]" ✓

Example 2: Medical record

In this case, Sentinel-PII correctly identifies sensitive medical information including patient identifiers, dates of birth, and medical conditions — critical for HIPAA compliance.

Input: "Patient: Sarah Johnson, DOB: 1985-03-15, SSN: 123-45-6789, Diagnosis: Type 2 Diabetes"

Output: "Patient: [PERSON_NAME], DOB: [DATE_OF_BIRTH], SSN: [PERSONAL_ID], Diagnosis: [MEDICAL_CONDITION]" ✓

Example 3: Account credentials

This example shows where Sentinel-PII excels: detecting authentication credentials and financial information that traditional regex systems might miss due to formatting variations.

Input: "Username: mike.williams, Password: MyP@ssw0rd123, Credit Card: 4532-1234-5678-9010"

Output: "Username: [PERSON_NAME], Password: [PASSWORD], Credit Card: [CREDIT_CARD_INFO]" ✓

Performance Metrics

Evaluated on the AI4Privacy PII-masking-300k dataset:

Category-Specific Recall Rates

CategoryRecallDescription
Critical PII
PERSONAL_ID98.5%SSN, national IDs
DATE_OF_BIRTH98.2%Birth dates
CREDIT_CARD_INFO97.8%Credit card numbers
PASSWORD96.9%Passwords
Identity
PERSON_NAME95.4%Personal names
EMAIL_ADDRESS97.2%Email addresses
PHONE_NUMBER96.5%Phone numbers
USERNAME94.8%User identifiers
Location
STREET_ADDRESS96.5%Physical addresses
POSTCODE99.3%ZIP/postal codes
CITY97.6%City names
COUNTRY96.1%Country names
Medical
MEDICAL_CONDITION93.2%Health information
Organization
ORGANIZATION_NAME94.7%Company names

Dataset composition

We constructed a balanced 1,500 sample dataset from two sources:

DatasetSamplesSourceCoverage
ai4privacy PII Masking1,000Real-world examples23 PII categories from diverse contexts
Synthetic Generation (Faker)500Programmatically generatedEdge cases, multiple PII types per example

PII categories covered

Sentinel-PII detects 20 standardized PII categories:

  • Identity: person_name, username, personal_id, other_id
  • Contact: email_address, phone_number, street_address, domain_name
  • Demographics: age, date_of_birth, gender, nationality, demographic_group, religious_affiliation
  • Financial: credit_card_info, banking_number
  • Medical: medical_condition
  • Organizations: organization_name
  • Credentials: password, secure_credential
  • Temporal: date

Try Sentinel-PII

We're releasing it as an opensource model for anyone to try out, explore, or build on. The model is available on Hugging Face Hub in multiple formats for different deployment scenarios.

Use cases

Sentinel-PII is ideal for:

  • Data anonymization: Redact PII before sharing datasets for analysis or ML training
  • Compliance automation: Automatically detect PII in documents for GDPR/HIPAA compliance
  • Customer support: Mask sensitive information in support tickets and chat logs
  • Document processing: Identify PII in scanned documents, PDFs, and images (when combined with OCR)
  • Data loss prevention: Monitor outgoing communications for accidental PII exposure

Comparison with existing solutions

FeatureSentinel-PIIPresidioAWS ComprehendRegex-based
AccuracyHigh (context-aware)Medium-HighHighLow-Medium
SpeedFast (~50-100 tok/s)FastFast (API)Very Fast
CostFree (self-hosted)Free (self-hosted)Pay-per-useFree
Offline
Customizable✓ (fine-tune)✓ (rules)Limited
PII Categories2015+20+Depends
Context-awareLimited