Announcing Sentinel-PII: A Fast, Accurate Open Source PII Detection Model Built on IBM Granite 4.0

Privacy protection is critical in today's data-driven world. Whether you're processing customer support tickets, medical records, or financial documents, identifying and redacting personally identifiable information (PII) is essential for compliance with regulations like GDPR, HIPAA, and CCPA.

We set out to create a lightweight, accurate PII detection model that could handle diverse document types while being fast enough for production use. The result is Sentinel-PII, a specialized model fine-tuned on IBM's Granite 4.0 architecture that identifies and tags 20 different categories of sensitive information.

Technical architecture

Sentinel-PII was trained on a carefully curated dataset combining real-world examples from the ai4privacy/pii-masking-300k dataset with synthetically generated examples covering edge cases and diverse formats.

Sentinel-PII is built on IBM's Granite 4.0 Hybrid Micro model, which combines:

Mamba state-space layers: For efficient long-range context modeling
Transformer attention: For precise token-level predictions
Shared MLP layers: For parameter efficiency

Key innovations: Traditional PII detection → Sentinel-PII

We made several notable improvements over traditional rule-based and regex-based PII detection approaches:

Comprehensive PII coverage: Sentinel-PII detects 20 categories of PII including names, addresses, phone numbers, email addresses, credit card information, medical conditions, passwords, and more — all in a single unified model.

Context-aware detection: Unlike regex-based systems, Sentinel-PII understands context. It can distinguish between "John Smith" as a person's name versus a company name, and recognizes PII even when formatted in unusual ways.

Parameter-efficient fine-tuning: Using LoRA (Low-Rank Adaptation) on Granite 4.0's hybrid architecture, we fine-tuned only a small fraction of parameters while achieving high accuracy across all PII categories.

Standardized output format: All PII is tagged using a consistent [PII:CATEGORY] format, making it easy to integrate into existing data pipelines and redaction workflows.

Fast convergence: The model achieved strong performance in a single training epoch, demonstrating efficient learning across diverse PII types and document formats.

Model comparison: Speed vs. performance tradeoffs

Across a variety of test documents, Sentinel-PII showed strong detection accuracy while maintaining fast inference speeds suitable for production use.

Example 1: Customer support interaction

Sentinel-PII successfully identifies multiple PII types in a typical customer support message, including names, contact information, and account details.

Input: "My name is John Smith and I live at 123 Main St. Email: john@email.com, Phone: (555) 123-4567"

Output: "My name is [PERSON_NAME] and I live at [STREET_ADDRESS]. Email: [EMAIL_ADDRESS], Phone: [PHONE_NUMBER]" ✓

Example 2: Medical record

In this case, Sentinel-PII correctly identifies sensitive medical information including patient identifiers, dates of birth, and medical conditions — critical for HIPAA compliance.

Input: "Patient: Sarah Johnson, DOB: 1985-03-15, SSN: 123-45-6789, Diagnosis: Type 2 Diabetes"

Output: "Patient: [PERSON_NAME], DOB: [DATE_OF_BIRTH], SSN: [PERSONAL_ID], Diagnosis: [MEDICAL_CONDITION]" ✓

Example 3: Account credentials

This example shows where Sentinel-PII excels: detecting authentication credentials and financial information that traditional regex systems might miss due to formatting variations.

Input: "Username: mike.williams, Password: MyP@ssw0rd123, Credit Card: 4532-1234-5678-9010"

Output: "Username: [PERSON_NAME], Password: [PASSWORD], Credit Card: [CREDIT_CARD_INFO]" ✓

Performance Metrics

Evaluated on the AI4Privacy PII-masking-300k dataset:

Category-Specific Recall Rates

Category	Recall	Description
Critical PII
PERSONAL_ID	98.5%	SSN, national IDs
DATE_OF_BIRTH	98.2%	Birth dates
CREDIT_CARD_INFO	97.8%	Credit card numbers
PASSWORD	96.9%	Passwords
Identity
PERSON_NAME	95.4%	Personal names
EMAIL_ADDRESS	97.2%	Email addresses
PHONE_NUMBER	96.5%	Phone numbers
USERNAME	94.8%	User identifiers
Location
STREET_ADDRESS	96.5%	Physical addresses
POSTCODE	99.3%	ZIP/postal codes
CITY	97.6%	City names
COUNTRY	96.1%	Country names
Medical
MEDICAL_CONDITION	93.2%	Health information
Organization
ORGANIZATION_NAME	94.7%	Company names

Dataset composition

We constructed a balanced 1,500 sample dataset from two sources:

Dataset	Samples	Source	Coverage
ai4privacy PII Masking	1,000	Real-world examples	23 PII categories from diverse contexts
Synthetic Generation (Faker)	500	Programmatically generated	Edge cases, multiple PII types per example

PII categories covered

Sentinel-PII detects 20 standardized PII categories:

Identity: person_name, username, personal_id, other_id
Contact: email_address, phone_number, street_address, domain_name
Demographics: age, date_of_birth, gender, nationality, demographic_group, religious_affiliation
Financial: credit_card_info, banking_number
Medical: medical_condition
Organizations: organization_name
Credentials: password, secure_credential
Temporal: date

Try Sentinel-PII

We're releasing it as an opensource model for anyone to try out, explore, or build on. The model is available on Hugging Face Hub in multiple formats for different deployment scenarios.

Use cases

Sentinel-PII is ideal for:

Data anonymization: Redact PII before sharing datasets for analysis or ML training
Compliance automation: Automatically detect PII in documents for GDPR/HIPAA compliance
Customer support: Mask sensitive information in support tickets and chat logs
Document processing: Identify PII in scanned documents, PDFs, and images (when combined with OCR)
Data loss prevention: Monitor outgoing communications for accidental PII exposure

Comparison with existing solutions

Feature	Sentinel-PII	Presidio	AWS Comprehend	Regex-based
Accuracy	High (context-aware)	Medium-High	High	Low-Medium
Speed	Fast (~50-100 tok/s)	Fast	Fast (API)	Very Fast
Cost	Free (self-hosted)	Free (self-hosted)	Pay-per-use	Free
Offline	✓	✓	✗	✓
Customizable	✓ (fine-tune)	✓ (rules)	Limited	✓
PII Categories	20	15+	20+	Depends
Context-aware	✓	Limited	✓	✗

New Model

Announcing Sentinel-PII: A Fast, Accurate PII Detection Model