New Model
Legal Document OCR
Building Cernis-Legal-OCR: A Specialized Vision Model for Legal Documents
Building Cernis-Legal-OCR: A Specialized Vision Model for Legal Document Processing
Legal documents are notorious for breaking OCR systems. Court opinions, filings, and case law often come with dense text, varied formatting, and artifacts from scanning or photocopying. In this post, we'll walk through our end-to-end experiment where we fine-tuned a multimodal model on 5,000 synthetic legal documents, starting from raw case law text and ending with a working OCR system.
We built Cernis-Legal-OCR to tackle these challenges head-on — a specialized vision-language model fine-tuned specifically for legal document processing, built on Qwen2.5-VL-7B and trained on synthetic legal documents generated from the Caselaw Access Project dataset.
The Legal Document Challenge: More Than Just Text Recognition
Legal documents aren't just text on a page — they're structured semantic artifacts with specific formatting conventions and legal significance. Consider what makes them so challenging:
📄 Scan Quality Chaos: Legal documents span centuries, from pristine PDFs to barely-legible fax copies of photocopied originals. A single contract might contain clean typed text alongside handwritten amendments and stamped seals.
🏛️ Structural Complexity: Understanding that "WHEREAS" introduces a recital clause while "NOW, THEREFORE" signals operative provisions isn't just helpful — it's essential for downstream legal AI applications.
✍️ Mixed Content Types: Margin notes from attorneys, court stamps, signatures, and formal legal text all coexist on the same page, each requiring different processing approaches.
🔍 Precision Requirements: In legal contexts, OCR errors aren't just inconvenient — they can be legally significant. Misreading "shall" as "should" or getting a date wrong can have serious consequences.
Standard OCR solutions like Tesseract or even cloud APIs struggle with this complexity, often producing garbled output or missing critical structural information that legal professionals need.
Our Approach: Synthetic Data Generation from Real Legal Text
Rather than trying to collect and annotate thousands of scanned legal documents (which would be expensive for our small team and potentially problematic from a privacy standpoint), we took a different approach: synthetic document generation.
We started with the Caselaw Access Project dataset — 6.7 million cases from US federal and state courts spanning 365 years. This gave us access to authentic legal language, terminology, and document structures, all in the public domain.
Creating Realistic Legal Document Images
Our pipeline transforms raw legal text into realistic scanned document images:
🖨️ Document Rendering: We render legal text using authentic legal document formatting — proper margins, standard legal fonts (Times New Roman, Courier), and structural elements like case headers and clause numbering.
📸 Realistic Degradation: The magic happens in the augmentation layer. We simulate real-world scanning conditions:
- Scan artifacts: Noise, contrast variations, and compression artifacts
- Physical deterioration: Age spots, yellowing, and stains
- Photocopying effects: Speckles, uneven contrast, and quality loss
- Fax transmission: Horizontal lines and resolution degradation
- Document skew: Slight rotations from imperfect scanning
⚖️ Legal-Specific Elements: We add realistic legal document features:
- Court stamps ("FILED", "COPY", "CONFIDENTIAL")
- Handwritten margin notes and annotations
- Attorney signatures and seals
- Multi-column layouts and footnotes
This approach let us generate thousands of training samples that look and feel like real scanned legal documents while maintaining perfect ground truth for the underlying text.
Model Architecture: Building on Qwen2.5-VL
We chose Qwen2.5-VL-7B-Instruct as our foundation model for several reasons:
- Strong vision-language capabilities: Proven performance on document understanding tasks
- Efficient architecture: 7B parameters strike a good balance between capability and deployment cost
- Fine-tuning friendly: Works well with LoRA (Low-Rank Adaptation) for parameter-efficient training
Our fine-tuning setup used LoRA with these configurations:
- Rank (r): 16
- Alpha: 16
- Target modules: All linear layers in both vision and language components
- Trainable parameters: 51.5M out of 8.3B total (0.62%)
This approach keeps training costs low while achieving strong performance on legal document OCR tasks.
Training Results: Fast Convergence on Legal Documents
The model showed excellent convergence during training. The rapid convergence suggests that the synthetic training data successfully captured the patterns needed for legal document OCR, while the Qwen2.5-VL foundation provided a strong starting point for vision-language understanding.
Model Comparison: Performance on Legal Documents
We tested Cernis-Legal-OCR against standard OCR approaches on various legal document types:
Example 1: Court Filing with Stamps On a court document with multiple official stamps and handwritten docket numbers, Cernis-Legal-OCR correctly identified the case number, court information, and filing date while preserving the reading order. Generic OCR engines often struggle with the overlapping stamps and varied text orientations.
Example 2: Contract with Margin Annotations For a contract with handwritten margin notes and amendments, our model successfully separated the original printed text from the annotations, maintaining the legal document's structure while capturing all textual content.
Example 3: Historical Court Opinion On a scanned 1970s court opinion with aged paper and inconsistent contrast, Cernis-Legal-OCR demonstrated robust performance despite the poor image quality, correctly extracting case citations and legal reasoning that would be crucial for legal research applications.
Real-World Applications
Cernis-Legal-OCR enables several practical applications:
⚖️ Legal Research: Converting scanned case law into searchable, structured text for legal databases
📋 Contract Analysis: Extracting key terms, dates, and clauses from scanned contracts for automated review
🏛️ Court Record Digitization: Processing historical court filings and maintaining their legal significance
📝 Due Diligence: Rapidly processing large volumes of legal documents in M&A transactions
Try Cernis-Legal-OCR
We're releasing Cernis-Legal-OCR under open source licensing for the legal tech community. The model is available on Hugging Face.
Key Takeaways
Building domain-specific OCR models doesn't require massive datasets of annotated images. By combining:
- Public domain text data (Caselaw Access Project)
- Realistic synthetic image generation
- Modern vision-language models (Qwen2.5-VL)
- Parameter-efficient fine-tuning (LoRA)
We created a specialized legal OCR model that outperforms generic solutions on legal documents while keeping development costs reasonable.
For legal tech teams building document processing pipelines, this approach offers a path to high-quality, domain-specific OCR without the expense and complexity of traditional annotation workflows.
Cernis-Legal-OCR was trained on synthetic data generated from the Caselaw Access Project. The model and training pipeline are available for research and commercial use.