Announcing CernisOCR: A Faster, Lighter Multi-Domain OCR Model

September 12, 2025

Earlier this year, we set out to create a unified OCR solution that could handle multiple document types — from mathematical formulas to handwritten notes to structured invoices — all in a single model. We were excited by the potential of vision-language models for document processing and saw an opportunity to leverage the latest Qwen2.5-VL architecture with parameter-efficient fine-tuning to create something both powerful and practical.

Why This Work Matters

Document parsing / OCR remains a challenge: different layouts, text styles (printed vs. handwritten), noise, variable quality, etc. Existing off-the-shelf solutions often struggle when combining very different domains (receipts + math + handwriting). Qwen2.5-VL is powerful, but out-of-the-box, it isn't specialized for all OCR tasks. Fine-tuning gives us a way to:

Improve recognition accuracy on domain data
Handle varied image types/layouts
Extract structured info (e.g. invoices) as well as raw text
Do it with somewhat limited resources (we used ≈10k examples, modest GPU budget)

Multi-domain training: We unified three traditionally separate OCR tasks into a single model — mathematical LaTeX conversion, handwritten text transcription, and invoice/receipt data extraction.

Fast convergence: The model achieved 97.6% loss reduction demonstrating efficient learning across diverse document types.

Training completed on RTX 4090 with vLLM support.

Dataset composition

We constructed a balanced 10k sample dataset from three sources:

Dataset	Samples	Task Type	Format
LaTeX OCR	3,978	Mathematical notation → LaTeX	Images of formulas
Invoices & Receipts	2,043	Structured document extraction	Invoice images → JSON
Handwritten Text	3,978	Handwriting transcription	Handwritten images → text

Example 1: Mathematical formula conversion

Using a complex handwritten mathematical equation, CernisOCR produces accurate LaTeX output. It correctly captures mathematical notation that would typically require specialized mathematical OCR tools, including fractions, integrals, and Greek letters. The unified approach means the same model can handle both the visual recognition and the LaTeX formatting in one step.

Input: Handwritten equation with complex notation
Output: \int_{0}^{\infty} \frac{\sin(x)}{x} dx = \frac{\pi}{2} ✓

Example 2: Handwritten text transcription

In this case, CernisOCR successfully transcribes cursive handwriting that would challenge traditional OCR systems. The model maintains reading order and handles various handwriting styles, from neat printing to flowing cursive script.

Input: Handwritten note "Today, however, that the Brussels Treaty"
Output: "Today, however, that the Brussels Treaty" ✓

Example 3: Invoice data extraction

This example shows where CernisOCR excels: extracting structured data from invoices and receipts. The model identifies key fields like vendor names, amounts, dates, and line items, outputting them in a structured JSON format suitable for downstream processing.

Input: Scanned invoice with mixed text and numbers
Output: {"vendor": "ABC Corp", "total": "$1,234.56", "date": "2024-09-12", "items": [...]} ✓

Try CernisOCR

We're excited to share it as an opensource model for anyone to try out. You can find instructions, example code, and model details in the HuggingFace README.

You can check out our document studio to help accurately transform your unstructured documents into structured, reliable data.

From training domain-specific state-of-the-art OCR models to enterprise-grade PII protection, our comprehensive document intelligence platform transforms unstructured data into actionable insights. Built for privacy, designed for scale, trusted by industry leaders.

New Model