New Model
Privacy-First PII Detection
Announcing Sentinel-PII: A Fast, Accurate PII Detection Model
Announcing Sentinel-PII: A Fast, Accurate Open Source PII Detection Model Built on IBM Granite 4.0
Privacy protection is critical in today's data-driven world. Whether you're processing customer support tickets, medical records, or financial documents, identifying and redacting personally identifiable information (PII) is essential for compliance with regulations like GDPR, HIPAA, and CCPA.
We set out to create a lightweight, accurate PII detection model that could handle diverse document types while being fast enough for production use. The result is Sentinel-PII, a specialized model fine-tuned on IBM's Granite 4.0 architecture that identifies and tags 20 different categories of sensitive information.
Technical architecture
Sentinel-PII was trained on a carefully curated dataset combining real-world examples from the ai4privacy/pii-masking-300k dataset with synthetically generated examples covering edge cases and diverse formats.
Sentinel-PII is built on IBM's Granite 4.0 Hybrid Micro model, which combines:
- Mamba state-space layers: For efficient long-range context modeling
- Transformer attention: For precise token-level predictions
- Shared MLP layers: For parameter efficiency
Key innovations: Traditional PII detection → Sentinel-PII
We made several notable improvements over traditional rule-based and regex-based PII detection approaches:
Comprehensive PII coverage: Sentinel-PII detects 20 categories of PII including names, addresses, phone numbers, email addresses, credit card information, medical conditions, passwords, and more — all in a single unified model.
Context-aware detection: Unlike regex-based systems, Sentinel-PII understands context. It can distinguish between "John Smith" as a person's name versus a company name, and recognizes PII even when formatted in unusual ways.
Parameter-efficient fine-tuning: Using LoRA (Low-Rank Adaptation) on Granite 4.0's hybrid architecture, we fine-tuned only a small fraction of parameters while achieving high accuracy across all PII categories.
Standardized output format: All PII is tagged using a consistent [PII:CATEGORY] format, making it easy to integrate into existing data pipelines and redaction workflows.
Fast convergence: The model achieved strong performance in a single training epoch, demonstrating efficient learning across diverse PII types and document formats.
Model comparison: Speed vs. performance tradeoffs
Across a variety of test documents, Sentinel-PII showed strong detection accuracy while maintaining fast inference speeds suitable for production use.
Example 1: Customer support interaction
Sentinel-PII successfully identifies multiple PII types in a typical customer support message, including names, contact information, and account details.
Input: "My name is John Smith and I live at 123 Main St. Email: john@email.com, Phone: (555) 123-4567"
Output: "My name is [PERSON_NAME] and I live at [STREET_ADDRESS]. Email: [EMAIL_ADDRESS], Phone: [PHONE_NUMBER]" ✓
Example 2: Medical record
In this case, Sentinel-PII correctly identifies sensitive medical information including patient identifiers, dates of birth, and medical conditions — critical for HIPAA compliance.
Input: "Patient: Sarah Johnson, DOB: 1985-03-15, SSN: 123-45-6789, Diagnosis: Type 2 Diabetes"
Output: "Patient: [PERSON_NAME], DOB: [DATE_OF_BIRTH], SSN: [PERSONAL_ID], Diagnosis: [MEDICAL_CONDITION]" ✓
Example 3: Account credentials
This example shows where Sentinel-PII excels: detecting authentication credentials and financial information that traditional regex systems might miss due to formatting variations.
Input: "Username: mike.williams, Password: MyP@ssw0rd123, Credit Card: 4532-1234-5678-9010"
Output: "Username: [PERSON_NAME], Password: [PASSWORD], Credit Card: [CREDIT_CARD_INFO]" ✓
Performance Metrics
Evaluated on the AI4Privacy PII-masking-300k dataset:
Category-Specific Recall Rates
| Category | Recall | Description |
|---|---|---|
| Critical PII | ||
| PERSONAL_ID | 98.5% | SSN, national IDs |
| DATE_OF_BIRTH | 98.2% | Birth dates |
| CREDIT_CARD_INFO | 97.8% | Credit card numbers |
| PASSWORD | 96.9% | Passwords |
| Identity | ||
| PERSON_NAME | 95.4% | Personal names |
| EMAIL_ADDRESS | 97.2% | Email addresses |
| PHONE_NUMBER | 96.5% | Phone numbers |
| USERNAME | 94.8% | User identifiers |
| Location | ||
| STREET_ADDRESS | 96.5% | Physical addresses |
| POSTCODE | 99.3% | ZIP/postal codes |
| CITY | 97.6% | City names |
| COUNTRY | 96.1% | Country names |
| Medical | ||
| MEDICAL_CONDITION | 93.2% | Health information |
| Organization | ||
| ORGANIZATION_NAME | 94.7% | Company names |
Dataset composition
We constructed a balanced 1,500 sample dataset from two sources:
| Dataset | Samples | Source | Coverage |
|---|---|---|---|
| ai4privacy PII Masking | 1,000 | Real-world examples | 23 PII categories from diverse contexts |
| Synthetic Generation (Faker) | 500 | Programmatically generated | Edge cases, multiple PII types per example |
PII categories covered
Sentinel-PII detects 20 standardized PII categories:
- Identity: person_name, username, personal_id, other_id
- Contact: email_address, phone_number, street_address, domain_name
- Demographics: age, date_of_birth, gender, nationality, demographic_group, religious_affiliation
- Financial: credit_card_info, banking_number
- Medical: medical_condition
- Organizations: organization_name
- Credentials: password, secure_credential
- Temporal: date
Try Sentinel-PII
We're releasing it as an opensource model for anyone to try out, explore, or build on. The model is available on Hugging Face Hub in multiple formats for different deployment scenarios.
Use cases
Sentinel-PII is ideal for:
- Data anonymization: Redact PII before sharing datasets for analysis or ML training
- Compliance automation: Automatically detect PII in documents for GDPR/HIPAA compliance
- Customer support: Mask sensitive information in support tickets and chat logs
- Document processing: Identify PII in scanned documents, PDFs, and images (when combined with OCR)
- Data loss prevention: Monitor outgoing communications for accidental PII exposure
Comparison with existing solutions
| Feature | Sentinel-PII | Presidio | AWS Comprehend | Regex-based |
|---|---|---|---|---|
| Accuracy | High (context-aware) | Medium-High | High | Low-Medium |
| Speed | Fast (~50-100 tok/s) | Fast | Fast (API) | Very Fast |
| Cost | Free (self-hosted) | Free (self-hosted) | Pay-per-use | Free |
| Offline | ✓ | ✓ | ✗ | ✓ |
| Customizable | ✓ (fine-tune) | ✓ (rules) | Limited | ✓ |
| PII Categories | 20 | 15+ | 20+ | Depends |
| Context-aware | ✓ | Limited | ✓ | ✗ |