Processing Pipeline¶

The pipeline orchestrates the full document processing flow: Extract → OCR → Process → Format.

Pipeline Flow¶

sequenceDiagram
    participant API as FastAPI
    participant SVC as DocumentService
    participant PIPE as ProcessingPipeline
    participant RT as ExtractorRouter
    participant EXT as Extractor
    participant OCR as OCR Engine
    participant PP as PostProcessor
    participant FMT as Formatter

    API->>SVC: process_document(file, options)
    SVC->>SVC: Detect MIME type
    SVC->>SVC: Create Document entity
    SVC->>PIPE: run(document, content, mime_type)

    PIPE->>RT: resolve(mime_type, preferred_engine)
    RT-->>PIPE: extractor

    PIPE->>EXT: extract(content, mime_type)
    EXT-->>PIPE: ExtractionResult

    alt Needs OCR (image or scanned PDF)
        PIPE->>OCR: recognize(content, language)
        OCR-->>PIPE: text
    end

    loop Each PostProcessor
        PIPE->>PP: process(text, context)
        PP-->>PIPE: processed_text
    end

    PIPE->>FMT: format(text, metadata)
    FMT-->>PIPE: ProcessingResult

    PIPE-->>SVC: Document (completed)
    SVC-->>API: DocumentResponse

ExtractorRouter¶

The router selects the best extractor based on MIME type and user preference:

MIME Type	Default Extractor	Reason
`application/pdf`	PyMuPDF	10-50x faster, no network
`application/msword`	Tika	Office format support
`image/*`	Tika → OCR	Fallback to OCR
Everything else	Tika	Universal support

Users can override with extraction_engine=tika or extraction_engine=pymupdf.

OCR Decision Logic¶

def _needs_ocr(mime_type, text, ocr_engine):
    if ocr_engine == "none":
        return False
    if mime_type in IMAGE_TYPES:
        return True
    if mime_type == "application/pdf" and len(text.strip()) < 50:
        return True  # Scanned PDF
    return False

Error Handling¶

The pipeline catches all exceptions and marks the document as failed:

try:
    # Extract → Process → Format
    document.mark_completed(content, elapsed_ms)
except Exception as exc:
    document.mark_failed(str(exc))

No unhandled exceptions propagate to the API layer.