Day 3: Building a Document Processing Pipeline That Scales to 1M+ Documents/Day

This is Day 3 of a 7-part series on building Cernis Intelligence: Document Intelligence for the AI era.

Before processing our first production document, we asked: What would this system look like at 1M documents per day? Then we worked backwards. How do you build a document AI infrastructure capable of handling 1 million documents per day? The answer lies in separating concerns, choosing components that scale independently, and understanding where bottlenecks emerge at different scales.

Architecture Overview

The pipeline consists of five layers, each with a well-defined responsibility:

Client
   │
   ▼
┌─────────────────────────────────────┐
│  API Layer (Cloud Run)              │
│  Stateless request handling         │
└──────┬──────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────┐
│  Storage Layer (Cloud Storage)      │
│  Ephemeral document staging         │
└──────┬──────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────┐
│  Queue Layer (Redis + RQ)           │
│  Job distribution & status          │
└──────┬──────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────┐
│  Worker Layer (Compute Engine)      │
│  Job orchestration                  │
└──────┬──────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────┐
│  Compute Layer (Modal GPUs)         │
│  OCR inference                      │
└─────────────────────────────────────┘

Each layer scales independently. We could increase worker capacity without modifying the API, scale GPU compute without provisioning additional workers, and rely on the queue layer to absorb traffic spikes naturally.

API requests scale with user concurrency
Job queuing scales with throughput
Worker orchestration scales with processing capacity
GPU inference scales with compute demand

Layer 1: API Layer (Cloud Run)

Design goal: Accept documents fast, return immediately, never block on processing.

The API layer handles document uploads and job status queries. We made a deliberate architectural decision to keep this layer stateless and fast.

When a client uploads a document, the API performs a minimal set of operations: persist the file to Cloud Storage, create database records for the document and processing job, enqueue the job identifier to Redis, and return immediately with a job ID.

The asynchronous design is crucial. The API never waits for processing to complete. This decouples request latency from processing time, allowing the API to maintain consistent response times even as our OCR models evolve and processing times vary. Cloud Run's autoscaling handles request spikes naturally.

Layer 2: Storage Layer (Cloud Storage)

Design goal: Stage documents between upload and processing, then delete them due to our privacy-first focus.

Cloud Storage serves as the intermediate storage layer between upload and processing. The API writes uploaded documents here, workers read them for processing, and workers delete them upon completion.

We chose Cloud Storage for its operational simplicity and proven scalability. At 1M documents per day, the system handles approximately 11 uploads, 11 downloads, and 11 deletes per second, well within Cloud Storage's capacity of 5,000+/1000+ write/read requests per second per bucket.

The storage layer introduces no scaling constraints. Files are ephemeral by design, deleted immediately after processing completes, which keeps storage costs constant and eliminates any concerns about unbounded growth.

Layer 3: Queue Layer (Redis + RQ)

Design goal: Distribute work, track status, absorb traffic spikes.

The queue layer distributes work to workers and tracks job state. We use Redis Queue (RQ) on top of GCP's Redis Memorystore, which offers sub-millisecond data access, scalability, and high availability, and also provides the job management primitives we need: enqueue, dequeue, retry logic, and failure handling.

Each job in the queue is simply a reference to a database record: process_document_job(job_id). This keeps the queue lightweight and allows us to handle millions of jobs without significant memory pressure, with the queue depth serving as a real-time signal for scaling decisions.

For our target workload at 1M documents per day, even a basic 1GB Redis instance provides significant headroom.

Layer 4: Worker Layer (Compute Engine)

Design goal: Orchestrate processing, scale horizontally without coordination.

Workers orchestrate the processing pipeline. They dequeue jobs from Redis, retrieve documents from Cloud Storage, invoke OCR processing on Modal, aggregate results, persist to the database, and clean up temporary files.

We designed workers to be completely stateless. Each worker processes one job at a time without any coordination with other workers. This design choice enables straightforward horizontal scaling; we can add or remove workers without any state migration or coordination overhead.

Layer 5: Compute Layer (Modal GPUs)

Design goal: Run OCR models cost-effectively across heterogeneous workloads.

The compute layer handles the most computationally intensive and cost-significant component of our pipeline.

This is where the interesting engineering happens. Our customers have fundamentally different requirements:

Batch processing: Cost-sensitive, latency-tolerant. Process overnight, minimize spend.
Interactive apps: Latency-sensitive. Sub-second response for real-time extraction.
High-accuracy: Accuracy-sensitive. Full bounding boxes, layout understanding, and maximum quality.

Solution: Tiered OCR Architecture

We implemented a three-tier OCR architecture on Modal that allows customers to explicitly choose their speed/accuracy/cost tradeoff:

Fast Tier: Optimized for high-volume batch processing. Handles basic text extraction efficiently without bounding box detection.

Moderate Tier: Provides balanced performance with support for bounding boxes and layout understanding.

Premium Tier: Delivers maximum accuracy with full support for line-level detection, word-level detection, and structured extraction.

Each tier runs as an independent Modal Function with dedicated GPU allocation. This isolation ensures that demand spikes in one tier (for example, a large batch job on the fast tier) do not impact the latency or availability of other tiers (such as interactive requests on the premium tier).

Implementation Details

Persistent Caching with Modal Volumes: We cache model weights and vLLM compilation artifacts in persistent Modal Volumes, significantly reducing cold start times. We're evaluating GPU Memory Snapshot as a future optimization to further reduce initialization latency.

Fast Boot Configuration: For latency-sensitive workloads, we enable eager mode execution, trading some runtime performance for faster cold starts. Combined with volume caching, we achieve cold boot times under 30 seconds even for our largest vision-language models (8B parameters).

Concurrent Page Processing: Workers convert multi-page PDFs to images and submit all pages concurrently to Modal. Modal's autoscaling provisions GPU instances as needed, processes pages in parallel, and returns results. Workers then aggregate outputs into the requested format (Markdown, HTML with bounding boxes, or JSON).

Modal's elastic infrastructure allows our tiered pipeline to comfortably scale to process over 1 million pages per day.

Database Layer (Cloud SQL)

Design goal: Maintain state reliably, scale vertically as long as possible.

The database maintains application state across two tables: Documents and ProcessingJobs. While we don't store documents, this layer allows us to track document details like title, size, and job status. At 1M documents per day, this translates to approximately 46 writes per second.

Architectural Design Principles

Several design principles enabled this architecture to scale smoothly:

Stateless Workers: Workers maintain no persistent state. They process jobs independently without coordination, enabling trivial horizontal scaling.

Asynchronous Processing: The API returns immediately after enqueuing work. Processing happens asynchronously, with clients polling for results. This design decouples request handling from processing latency.

Component Isolation: Each layer has a single, well-defined responsibility with clear interfaces. This separation enables each component to scale independently based on its specific constraints.

Pay-Per-Use Compute: GPU compute, our most expensive component, scales automatically with demand and costs nothing during idle periods. This economic model remains viable across a wide range of workload scales.

Conclusion

Scaling document processing infrastructure to 1M+ documents per day requires careful attention to component boundaries, independent scalability, and bottleneck identification.

The key decisions that enabled our scaling were: choosing a stateless worker architecture for trivial horizontal scaling, implementing asynchronous processing to decouple request handling from processing latency, isolating components with clear responsibilities, and leveraging pay-per-use GPU compute to maintain economic viability across scales.

Tomorrow, we will talk about Agentic OCR - Beyond Traditional Document Processing.