Introducing Precis: A Fast, Tiny Document Intelligence Model Built on IBM Granite

Earlier this year, OpenAI released GPT-4o and Anthropic launched Claude Sonnet 4 — both excellent at document summarization, but with significant costs and privacy concerns for sensitive documents. We were excited to see high-quality open-source alternatives like IBM's Granite 4.0 emerge — and saw an opportunity to build a specialized, privacy-preserving summarization model optimized for our document intelligence platform.

The result is Precis, a fine-tuned document intelligence model that's fast, runs locally, and maintains document privacy. We're releasing it open source for anyone working with confidential documents who needs local, on-premise processing.

This model was trained using efficient supervised fine-tuning (SFT), completely separate from cloud-based APIs, ensuring your documents never leave your infrastructure.

Project Overview

Objective: Train a production-ready LLM to generate comprehensive 300-word summaries of documents, optimized for question-answering capability.

Model Selection: IBM Granite 4.0-H-Micro

Parameters: ~3.2B
Architecture: Transformer with shared MLP layers
Memory footprint: 6GB VRAM (4-bit quantization)

Dataset Engineering

Data Sources

We created a hybrid dataset combining two complementary sources:

Source 1: CNN/DailyMail)

Pre-existing human-written summaries
News articles with professional highlights
Average length: ~750 words → ~60 word summaries

Source 2: ServiceNow RepLiQA

Synthetic non-factual documents
GPT-5 generated summaries (~300 words each)
Designed to test contextual understanding
Contains 5 Q&A pairs per document

Final Dataset Structure: 5,500 training examples total

Training Methodology

Supervised Fine-Tuning (SFT)

We chose SFT over Reinforcement Learning for several pragmatic reasons:

Why SFT?

Simpler implementation (no judge LLM required)
Deterministic training (no API failures)
Lower cost
Faster iteration cycles
Proven effectiveness for style transfer tasks

Evaluation Strategy

Benchmark Design

Primary Metric: Question-Answering Accuracy

Generate summaries with trained model
Use GPT-4 as judge to answer questions from summary only
Compare answers to ground truth
Calculate % questions answered correctly

Secondary Metrics:

ROUGE-L score vs reference summaries
Average summary length
% summaries exceeding 350-word limit

Qualitative Assessment

Strengths:

Consistent 300-word output
Maintains key facts and details
Natural language flow
Fast inference (0.5s/summary)

Deployment Preparation

Model Export Options

1. LoRA Adapters (150MB)

Smallest size
Requires base model
Fast upload to HuggingFace

2. Merged Float16 (6GB)

Production ready
vLLM optimized
Self-contained

3. Merged 4-bit (2GB)

Resource constrained deployment
Slightly slower inference

Conclusion

We successfully trained a specialized document summarization model using modern efficient fine-tuning techniques.

This project demonstrates that state-of-the-art results are achievable with modern efficient training frameworks, smart architecture choices, and pragmatic methodology.

The model is ready for production deployment and serves as a strong baseline for future RL optimization.

New Model

Introducing Precis: A Fast Document Summarization Model