New Model
Multi-Domain OCR
Announcing CernisOCR: A Faster, Lighter Multi-Domain OCR Model
Announcing CernisOCR: A Faster, Lighter Multi-Domain OCR Model
September 12, 2025
Earlier this year, we set out to create a unified OCR solution that could handle multiple document types — from mathematical formulas to handwritten notes to structured invoices — all in a single model. We were excited by the potential of vision-language models for document processing and saw an opportunity to leverage the latest Qwen2.5-VL architecture with parameter-efficient fine-tuning to create something both powerful and practical.
Why This Work Matters
Document parsing / OCR remains a challenge: different layouts, text styles (printed vs. handwritten), noise, variable quality, etc. Existing off-the-shelf solutions often struggle when combining very different domains (receipts + math + handwriting). Qwen2.5-VL is powerful, but out-of-the-box, it isn't specialized for all OCR tasks. Fine-tuning gives us a way to:
- Improve recognition accuracy on domain data
- Handle varied image types/layouts
- Extract structured info (e.g. invoices) as well as raw text
- Do it with somewhat limited resources (we used ≈10k examples, modest GPU budget)
Multi-domain training: We unified three traditionally separate OCR tasks into a single model — mathematical LaTeX conversion, handwritten text transcription, and invoice/receipt data extraction.
Fast convergence: The model achieved 97.6% loss reduction demonstrating efficient learning across diverse document types.
Training completed on RTX 4090 with vLLM support.
Dataset composition
We constructed a balanced 10k sample dataset from three sources:
| Dataset | Samples | Task Type | Format |
|---|---|---|---|
| LaTeX OCR | 3,978 | Mathematical notation → LaTeX | Images of formulas |
| Invoices & Receipts | 2,043 | Structured document extraction | Invoice images → JSON |
| Handwritten Text | 3,978 | Handwriting transcription | Handwritten images → text |
Example 1: Mathematical formula conversion
Using a complex handwritten mathematical equation, CernisOCR produces accurate LaTeX output. It correctly captures mathematical notation that would typically require specialized mathematical OCR tools, including fractions, integrals, and Greek letters. The unified approach means the same model can handle both the visual recognition and the LaTeX formatting in one step.
Input: Handwritten equation with complex notation
Output: \int_{0}^{\infty} \frac{\sin(x)}{x} dx = \frac{\pi}{2} ✓
Example 2: Handwritten text transcription
In this case, CernisOCR successfully transcribes cursive handwriting that would challenge traditional OCR systems. The model maintains reading order and handles various handwriting styles, from neat printing to flowing cursive script.
Input: Handwritten note "Today, however, that the Brussels Treaty"
Output: "Today, however, that the Brussels Treaty" ✓
Example 3: Invoice data extraction
This example shows where CernisOCR excels: extracting structured data from invoices and receipts. The model identifies key fields like vendor names, amounts, dates, and line items, outputting them in a structured JSON format suitable for downstream processing.
Input: Scanned invoice with mixed text and numbers
Output: {"vendor": "ABC Corp", "total": "$1,234.56", "date": "2024-09-12", "items": [...]} ✓
Try CernisOCR
We're excited to share it as an opensource model for anyone to try out. You can find instructions, example code, and model details in the HuggingFace README.
You can check out our document studio to help accurately transform your unstructured documents into structured, reliable data.
From training domain-specific state-of-the-art OCR models to enterprise-grade PII protection, our comprehensive document intelligence platform transforms unstructured data into actionable insights. Built for privacy, designed for scale, trusted by industry leaders.