SDK
Six Core Primitives
Day 1: Creating a Document AI SDK Users Actually Want
Day 1: Creating a Document AI SDK Users Actually Want
Six primitives that transform unstructured documents into production-ready data
This is Day 1 of a 7-part series on building Cernis Intelligence - Document Intelligence for the AI era
Most AI projects hit the same wall. Not model selection. Not prompt engineering. Not infrastructure. The bottleneck is always upstream: getting clean data out of the messy documents where it actually lives.
We're talking about millions of unstructured files: PDFs, scanned images, handwritten forms, invoices buried in email threads. Documents representing decades of institutional knowledge, trapped in formats that modern AI can't consume.
Over the past few years, I've helped traditional companies integrate AI workflows into their systems. Healthcare. Legal. Finance. Logistics. The pattern is always the same: sophisticated AI pipelines producing garbage results because the foundational document processing is broken. It doesn't matter how elegant your RAG architecture is or how carefully you've tuned your prompts.
Getting document processing right is the prerequisite for everything else. And that starts with parsing documents into clean, structured, production-ready data.
What Users Actually Want
Here's what we believe users actually need from a document AI SDK:
Multi-provider flexibility. No vendor lock-in. With new SOTA models released weekly, developers need access to the latest models for document parsing, while also ensuring the privacy of proprietary documents.
Reliability over features. Six functions that work 99% of the time beat twenty that work 80%.
Transparent costs. Token counting isn't nice-to-have, it's essential for budget management.
Other things we believe matter to users include type safety and cross-runtime support.
The Six Core Primitives
After solving this problem across various domains, we've identified the core elements that effectively address these wants. Every document processing workflow I've built ultimately reduces to these six fundamental operations.
1. ocr() — Converting Pixels to Text
The foundational problem: You have a scanned document, a screenshot, a photograph of a whiteboard. Without reliable OCR, nothing else works.
Why it matters: Legacy systems run on paper. Medical records from the 1990s. Legal contracts in filing cabinets. Customer forms are filled out by hand. Until these are converted to text, they remain invisible to AI, blocking companies from leveraging decades of data to survive and outperform competitors in today's ultra-competitive business landscape.
With our package, we solve that in 4 lines of code.
result = await ocr(
file_path="./scanned-contract.pdf",
provider="mistral",
api_key=api_key
)
print(result.text) # Full document text
print(result.markdown) # Preserved formatting
2. extract() — From Text to Structured Data
The real-world problem: If you're building real-time UI screens for AI apps, sometimes you don't want "a bunch of text." You want invoice_number: "INV-2025-001", total: 1247.50, line_items: [...].
Why it matters: Unstructured text breaks automation. You can't build reliable workflows when every document is a string blob. Structured extraction with schema validation is what makes document AI production-ready.
Raw text is just the beginning. Your database needs typed fields. Your API needs JSON. Your analytics dashboard needs validated data. We help you solve that issue with:
from pydantic import BaseModel, Field
from typing import List
class InvoiceLineItem(BaseModel):
description: str
quantity: int = Field(gt=0)
unit_price: float = Field(gt=0)
total: float
class Invoice(BaseModel):
invoice_number: str
date: str
vendor: str
line_items: List[InvoiceLineItem]
subtotal: float
tax: float | None = None
total: float
# Extract with full type safety
invoice = await extract(
file_path="./invoice.pdf",
response_format=Invoice,
provider="openai",
api_key=api_key
)
3. classify() — Understanding What You're Looking At
The multi-document problem: Real documents don't arrive neatly labeled. A 200-page patient record contains intake forms, lab results, treatment notes, insurance forms, and prescriptions, all merged into a single file.
You need to know what you're processing before you can decide how to process it.
Why it matters: Classification enables routing. Intake forms go to registration. Lab results go to physicians. Treatment notes go to billing. Without classification, you're treating all documents identically, and that adversely affects accuracy.
result = await classify(
file_path="./patient-record.pdf",
categories=[
CategoryDescription(
name="Patient Intake Form",
description="Initial patient information and medical history"
),
CategoryDescription(
name="Lab Results",
description="Laboratory test results with numeric values"
),
CategoryDescription(
name="Treatment Notes",
description="Doctor's notes from patient visits"
),
]
)
# Result shows exactly which pages contain what
# {
# "Patient Intake Form": {"pages": [1, 2], "confidence": "high"},
# "Lab Results": {"pages": [3, 4, 5, 8, 9], "confidence": "high"},
# "Treatment Notes": {"pages": [6, 7, 10, 11], "confidence": "high"}
# }
4. summarize() — Compression Without Loss of Meaning
The information overload problem: Legal contracts run 50+ pages. Research papers bury findings in dense methodology. Quarterly reports contain 100 pages of boilerplate hiding 3 pages of insight. Your users don't have time to read everything. They need the essentials.
summary = await summarize(
file_path="./legal-contract.pdf",
provider="anthropic",
api_key=api_key,
max_length=500
)
print(summary.text)
# "This service agreement between Acme Corp and Client Inc establishes...
# Key terms: 24-month duration, $50K monthly fee, 30-day cancellation notice..."
How this differs from extract(): Extraction pulls specific fields. Summarization compresses the entire document while preserving essential meaning. You're not looking for the vendor name, you're asking "what does this contract actually say?"
Why it matters: Human attention is the bottleneck. Summaries let experts review 10x more documents. They triage what needs deep reading versus what can be skipped.
5. chunk() — Semantic Segmentation for RAG
The context window problem: You have a 200-page document. Your LLM has a 128K token limit. Even if it fits, you don't want to pay for processing 200 pages when the answer is on page 47.
You need semantic-aware chunking to ensure segments maintain their contextual integrity and support more accurate, efficient retrieval.
chunks = await chunk(
file_path="./research-paper.pdf",
chunk_size=1000,
overlap=100,
strategy="semantic"
)
for chunk in chunks:
print(f"Page {chunk.page_number}: {chunk.text[:100]}...")
vector_db.add(
text=chunk.text,
metadata={"page": chunk.page_number, "section": chunk.section_title}
)
6. count_tokens() — Cost and Context Management
The practical problem: LLM APIs charge by token. Context windows have hard limits. Before processing a 100-page document, you need to know: Will this fit? What will it cost?
Why it matters: Production systems need cost predictability. Token counting prevents surprise bills and enables intelligent routing.
token_count = await count_tokens(
file_path="./large-document.pdf",
)
estimated_cost = token_count * 0.00001
if token_count > 100000:
print(f"Warning: {token_count} tokens, ~${estimated_cost:.2f}")
# Route to chunking or summarization
else:
result = await extract(file_path="./large-document.pdf", ...)
These six primitives are compositional. Complex problems dissolve into combinations:
Invoice processing:
classify()→ Identify invoice pages in mixed documentsextract()→ Pull structured datasummarize()→ Generate summary for approval workflows
Medical record analysis:
classify()→ Split 200-page record into sectionsocr()→ Extract text from handwritten noteschunk()→ Segment for RAG-based diagnosis assistance
Legal contract review:
ocr()→ Digitize scanned contractssummarize()→ Extract key terms and obligationsextract()→ Pull specific clauses
Type Safety: The Non-Negotiable
We also prioritize type safety because it directly strengthens reliability across the entire extraction pipeline. In practice, this gives you:
Compile-time guarantees — Your IDE catches schema mismatches before deployment:
invoice = await extract(file_path="invoice.pdf", response_format=Invoice)
print(invoice.totall) # IDE error: 'Invoice' has no attribute 'totall'
Runtime validation — Malformed LLM outputs rejected automatically:
# LLM returns {"total": "invalid"}
# Pydantic raises ValidationError before your code sees it
Self-documenting APIs — Your schema is documentation:
class Invoice(BaseModel):
invoice_number: str
total: float = Field(gt=0, description="Total amount in USD")
line_items: List[InvoiceLineItem] = Field(min_items=1)
We use industry-standard validation: Pydantic for Python, Zod for TypeScript. Tools your team already knows.
Cross-Runtime: Python and TypeScript
Lastly, we offer cross-runtime support. Python for ML workflows and TypeScript for web and edge processing because developers build in both worlds.
Both SDKs share the same primitives, and by exposing identical primitives in both runtimes, we let developers stay productive wherever their code runs.
Python:
invoice = await extract(
file_path="./invoice.pdf",
response_format=Invoice,
provider="openai",
api_key=api_key
)
TypeScript:
const invoice = await extract({
filePath: './invoice.pdf',
responseFormat: Invoice,
provider: 'openai',
apiKey: apiKey
});
Same semantics. Runtime-specific strengths.
The Bottom Line
The future of Document AI goes beyond better models. Developer experience is just as crucial. We've distilled all the insights above into Docuglean OCR, a toolkit that lets developers run intelligent document processing with state-of-the-art vision LLMs at scale. With our SDKs, you can efficiently extract structured outputs such as JSON, Markdown, and HTML directly from documents.
I'd love for you to try it out and let me know what you think!