Day 1: Creating a Document AI SDK Users Actually Want

Six primitives that transform unstructured documents into production-ready data

This is Day 1 of a 7-part series on building Cernis Intelligence - Document Intelligence for the AI era

Most AI projects hit the same wall. Not model selection. Not prompt engineering. Not infrastructure. The bottleneck is always upstream: getting clean data out of the messy documents where it actually lives.

We're talking about millions of unstructured files: PDFs, scanned images, handwritten forms, invoices buried in email threads. Documents representing decades of institutional knowledge, trapped in formats that modern AI can't consume.

Over the past few years, I've helped traditional companies integrate AI workflows into their systems. Healthcare. Legal. Finance. Logistics. The pattern is always the same: sophisticated AI pipelines producing garbage results because the foundational document processing is broken. It doesn't matter how elegant your RAG architecture is or how carefully you've tuned your prompts.

Getting document processing right is the prerequisite for everything else. And that starts with parsing documents into clean, structured, production-ready data.

What Users Actually Want

Here's what we believe users actually need from a document AI SDK:

Multi-provider flexibility. No vendor lock-in. With new SOTA models released weekly, developers need access to the latest models for document parsing, while also ensuring the privacy of proprietary documents.

Reliability over features. Six functions that work 99% of the time beat twenty that work 80%.

Transparent costs. Token counting isn't nice-to-have, it's essential for budget management.

Other things we believe matter to users include type safety and cross-runtime support.

The Six Core Primitives

After solving this problem across various domains, we've identified the core elements that effectively address these wants. Every document processing workflow I've built ultimately reduces to these six fundamental operations.

1. `ocr()` — Converting Pixels to Text

The foundational problem: You have a scanned document, a screenshot, a photograph of a whiteboard. Without reliable OCR, nothing else works.

Why it matters: Legacy systems run on paper. Medical records from the 1990s. Legal contracts in filing cabinets. Customer forms are filled out by hand. Until these are converted to text, they remain invisible to AI, blocking companies from leveraging decades of data to survive and outperform competitors in today's ultra-competitive business landscape.

With our package, we solve that in 4 lines of code.

result = await ocr(
  file_path="./scanned-contract.pdf",
  provider="mistral",
  api_key=api_key
)

print(result.text)      # Full document text
print(result.markdown)  # Preserved formatting

2. `extract()` — From Text to Structured Data

The real-world problem: If you're building real-time UI screens for AI apps, sometimes you don't want "a bunch of text." You want invoice_number: "INV-2025-001", total: 1247.50, line_items: [...].

Why it matters: Unstructured text breaks automation. You can't build reliable workflows when every document is a string blob. Structured extraction with schema validation is what makes document AI production-ready.

Raw text is just the beginning. Your database needs typed fields. Your API needs JSON. Your analytics dashboard needs validated data. We help you solve that issue with:

from pydantic import BaseModel, Field
from typing import List

class InvoiceLineItem(BaseModel):
  description: str
  quantity: int = Field(gt=0)
  unit_price: float = Field(gt=0)
  total: float

class Invoice(BaseModel):
  invoice_number: str
  date: str
  vendor: str
  line_items: List[InvoiceLineItem]
  subtotal: float
  tax: float | None = None
  total: float

# Extract with full type safety
invoice = await extract(
  file_path="./invoice.pdf",
  response_format=Invoice,
  provider="openai",
  api_key=api_key
)

3. `classify()` — Understanding What You're Looking At

The multi-document problem: Real documents don't arrive neatly labeled. A 200-page patient record contains intake forms, lab results, treatment notes, insurance forms, and prescriptions, all merged into a single file.

You need to know what you're processing before you can decide how to process it.

Why it matters: Classification enables routing. Intake forms go to registration. Lab results go to physicians. Treatment notes go to billing. Without classification, you're treating all documents identically, and that adversely affects accuracy.

result = await classify(
  file_path="./patient-record.pdf",
  categories=[
    CategoryDescription(
      name="Patient Intake Form",
      description="Initial patient information and medical history"
    ),
    CategoryDescription(
      name="Lab Results",
      description="Laboratory test results with numeric values"
    ),
    CategoryDescription(
      name="Treatment Notes",
      description="Doctor's notes from patient visits"
    ),
  ]
)

# Result shows exactly which pages contain what
# {
#   "Patient Intake Form": {"pages": [1, 2], "confidence": "high"},
#   "Lab Results": {"pages": [3, 4, 5, 8, 9], "confidence": "high"},
#   "Treatment Notes": {"pages": [6, 7, 10, 11], "confidence": "high"}
# }

4. `summarize()` — Compression Without Loss of Meaning

The information overload problem: Legal contracts run 50+ pages. Research papers bury findings in dense methodology. Quarterly reports contain 100 pages of boilerplate hiding 3 pages of insight. Your users don't have time to read everything. They need the essentials.

summary = await summarize(
  file_path="./legal-contract.pdf",
  provider="anthropic",
  api_key=api_key,
  max_length=500
)

print(summary.text)
# "This service agreement between Acme Corp and Client Inc establishes...
#  Key terms: 24-month duration, $50K monthly fee, 30-day cancellation notice..."

How this differs from extract(): Extraction pulls specific fields. Summarization compresses the entire document while preserving essential meaning. You're not looking for the vendor name, you're asking "what does this contract actually say?"

Why it matters: Human attention is the bottleneck. Summaries let experts review 10x more documents. They triage what needs deep reading versus what can be skipped.

5. `chunk()` — Semantic Segmentation for RAG

The context window problem: You have a 200-page document. Your LLM has a 128K token limit. Even if it fits, you don't want to pay for processing 200 pages when the answer is on page 47.

You need semantic-aware chunking to ensure segments maintain their contextual integrity and support more accurate, efficient retrieval.

chunks = await chunk(
  file_path="./research-paper.pdf",
  chunk_size=1000,
  overlap=100,
  strategy="semantic"
)

for chunk in chunks:
  print(f"Page {chunk.page_number}: {chunk.text[:100]}...")
  vector_db.add(
    text=chunk.text,
    metadata={"page": chunk.page_number, "section": chunk.section_title}
  )

6. `count_tokens()` — Cost and Context Management

The practical problem: LLM APIs charge by token. Context windows have hard limits. Before processing a 100-page document, you need to know: Will this fit? What will it cost?

Why it matters: Production systems need cost predictability. Token counting prevents surprise bills and enables intelligent routing.

token_count = await count_tokens(
  file_path="./large-document.pdf",
)

estimated_cost = token_count * 0.00001

if token_count > 100000:
  print(f"Warning: {token_count} tokens, ~${estimated_cost:.2f}")
  # Route to chunking or summarization
else:
  result = await extract(file_path="./large-document.pdf", ...)

These six primitives are compositional. Complex problems dissolve into combinations:

Invoice processing:

classify() → Identify invoice pages in mixed documents
extract() → Pull structured data
summarize() → Generate summary for approval workflows

Medical record analysis:

classify() → Split 200-page record into sections
ocr() → Extract text from handwritten notes
chunk() → Segment for RAG-based diagnosis assistance

Legal contract review:

ocr() → Digitize scanned contracts
summarize() → Extract key terms and obligations
extract() → Pull specific clauses

Type Safety: The Non-Negotiable

We also prioritize type safety because it directly strengthens reliability across the entire extraction pipeline. In practice, this gives you:

Compile-time guarantees — Your IDE catches schema mismatches before deployment:

invoice = await extract(file_path="invoice.pdf", response_format=Invoice)
print(invoice.totall)  # IDE error: 'Invoice' has no attribute 'totall'

Runtime validation — Malformed LLM outputs rejected automatically:

# LLM returns {"total": "invalid"}
# Pydantic raises ValidationError before your code sees it

Self-documenting APIs — Your schema is documentation:

class Invoice(BaseModel):
  invoice_number: str
  total: float = Field(gt=0, description="Total amount in USD")
  line_items: List[InvoiceLineItem] = Field(min_items=1)

We use industry-standard validation: Pydantic for Python, Zod for TypeScript. Tools your team already knows.

Cross-Runtime: Python and TypeScript

Lastly, we offer cross-runtime support. Python for ML workflows and TypeScript for web and edge processing because developers build in both worlds.

Both SDKs share the same primitives, and by exposing identical primitives in both runtimes, we let developers stay productive wherever their code runs.

Python:

invoice = await extract(
  file_path="./invoice.pdf",
  response_format=Invoice,
  provider="openai",
  api_key=api_key
)

TypeScript:

const invoice = await extract({
  filePath: './invoice.pdf',
  responseFormat: Invoice,
  provider: 'openai',
  apiKey: apiKey
});

Same semantics. Runtime-specific strengths.

The Bottom Line

The future of Document AI goes beyond better models. Developer experience is just as crucial. We've distilled all the insights above into Docuglean OCR, a toolkit that lets developers run intelligent document processing with state-of-the-art vision LLMs at scale. With our SDKs, you can efficiently extract structured outputs such as JSON, Markdown, and HTML directly from documents.

I'd love for you to try it out and let me know what you think!

SDK