Back to Research
Cernis

Research

Multi-pass Context-Aware Processing

Agentic OCR: Beyond Traditional Document Processing

2025-01-15Research

Agentic OCR: Beyond Traditional Document Processing

Why?

Data is the fuel for AI innovation and over the past couple of years, I've helped several trad companies integrate AI workflows into their systems and the biggest bottleneck has been extracting insights from their existing data - millions of unstructured data in the form of PDFs and images. Scanned, handwritten etc Garbage in. Garbage out.

It doesn't matter how sophisticated companies AI workflows, prompts, RAG, agents etc, getting the foundational document processing pipeline right is the only to get the most out of your data. That starts with parsing documents into ready-to-use, high-quality data.

Why Agentic OCR?

Over the past couple of months, there has been an huge influx of amazing SOTA VLMS from Qwen, Allen and HF. But Documents esp PDF are a tricky, complex bunch. From complex layouts, tables to formats, using just VLMs is just not enough.

Agentic OCR builds on this by creating a multi-pass context & layout - aware agent that catches and fixes errors, reviews outputs, enforces layout/structures and adds bounding boxes etc. A rich, seamless experience to achieve near-perfect accuracy.

At first, docuglean will work with these existing VLMs, but I would like to release our own custom domain-specific models that perform fast document intelligence without jeopardizing privacy.

The Foundation

Agentic OCR powered by SOTA opensource VLM models for parsing documents into high-quality, ready-to-use data; enterprise file types. Starts with existing vlm models better equipped to handle complex tables, formats and layouts.

Key Capabilities

  • Multi-pass context & layout-aware processing
  • Error detection and correction
  • Output review and validation
  • Layout and structure enforcement
  • Bounding box support
  • Near-perfect accuracy for complex documents

The Vision

The goal is to create a document processing pipeline that doesn't compromise on accuracy or privacy. By building on top of state-of-the-art VLMs and adding agentic capabilities, we're creating a system that understands documents the way humans do - with context, structure, and semantic awareness.