Introducing Cernis-Thinking: A Multi-Task Vision Language Model for Document Understanding

We're excited to release Cernis-Thinking, an open-source vision language model trained with reinforcement learning to understand and reason about documents. Built on Qwen2.5-VL-7B and fine-tuned using GRPO (Group Relative Policy Optimization), Cernis-Thinking can handle mathematical reasoning, LaTeX OCR, invoice extraction, and handwriting transcription — all in a single 7B parameter model.

Why Another Document Model?

Earlier this year, we saw the Allen Institute's olmOCR — an impressive open-source document OCR model. Inspired by their work and projects like RolmOCR, we wanted to explore whether reinforcement learning could create a more versatile document understanding model that doesn't just extract text, but actually reasons about what it sees.

Traditional document models often struggle with:

Understanding context beyond pure OCR
Solving problems that require multi-step reasoning
Handling diverse document types without task-specific fine-tuning
Providing structured, parseable outputs

Cernis-Thinking addresses these challenges by learning from rewards rather than just mimicking training examples.

Key Approach: Reinforcement Learning with Smart Rewards

Unlike supervised fine-tuning where models learn to copy examples, we used GRPO (Group Relative Policy Optimization) — a reinforcement learning technique that teaches the model to maximize rewards. This lets us directly optimize for what matters: correct, well-structured outputs.

Our Reward Functions

We designed three complementary reward functions:

1. Formatting Rewards

Rewards proper structure with <REASONING> and <SOLUTION> tags
Penalizes excessive artifacts
Makes outputs easily parseable downstream

2. Task-Specific Correctness Rewards

Math: Exact numeric matching
LaTeX/Handwriting: Fuzzy string matching
Invoices: Partial credit for key fields

3. ROUGE-Style Word Overlap (for OCR tasks)

Prevents wasted training on gibberish
Provides feedback even when outputs aren't perfect
Accelerates learning on text-heavy tasks

Training Details

Datasets (2,000 samples total)

We mixed four document understanding datasets:

Dataset	Samples	Task Type
AI4Math/MathVista	~500	Math word problems with images
unsloth/LaTeX_OCR	~500	Mathematical formula recognition
mychen76/invoices-and-receipts_ocr_v1	~500	Invoice extraction
corto-ai/handwritten-text	~500	Handwriting transcription

Results: Learning to Think

The training metrics tell a compelling story:

Step Range	Avg Reward	Best Reward	Observations
1-50	0.8-1.2	1.97	Early learning, inconsistent
51-150	1.2-1.6	2.25	Stable improvement
151-300	1.4-1.8	2.62	Strong multi-task performance
301-375	1.5-2.0	2.87	Consistent high performance

Example Outputs

Math Reasoning:

<REASONING>
In this parallelogram problem, I need to use:
1. Opposite sides are equal
2. The angle bisector creates a specific relationship
3. Using properties of triangles formed...
</REASONING>

<SOLUTION>
42
</SOLUTION>

LaTeX OCR:

<SOLUTION>
\frac{2}{3} < a^{2} \alpha^{2} \leq 1
</SOLUTION>

Invoice Extraction:

<SOLUTION>
Invoice No: 53553822
Date: 07/24/2012  
Vendor: Leo Brown
Total: $247.50
Tax ID: 926-74-9803
</SOLUTION>

Try Cernis-Thinking

We're releasing everything under Apache 2.0:

Model: cernis-intelligence/cernis-thinking
Training Code: Available in model card
License: Apache 2.0 (commercial-friendly)

Why Open Source?

We believe the future of document understanding requires models that can:

Reason about what they see, not just transcribe it
Adapt to diverse document types without retraining
Explain their outputs for trust and auditability

By open-sourcing Cernis-Thinking, we hope to inspire more research into RL-trained vision language models and practical document AI systems.

New Model