Processing Pipeline¶
The pipeline orchestrates the full document processing flow: Extract → OCR → Process → Format.
Pipeline Flow¶
sequenceDiagram
participant API as FastAPI
participant SVC as DocumentService
participant PIPE as ProcessingPipeline
participant RT as ExtractorRouter
participant EXT as Extractor
participant OCR as OCR Engine
participant PP as PostProcessor
participant FMT as Formatter
API->>SVC: process_document(file, options)
SVC->>SVC: Detect MIME type
SVC->>SVC: Create Document entity
SVC->>PIPE: run(document, content, mime_type)
PIPE->>RT: resolve(mime_type, preferred_engine)
RT-->>PIPE: extractor
PIPE->>EXT: extract(content, mime_type)
EXT-->>PIPE: ExtractionResult
alt Needs OCR (image or scanned PDF)
PIPE->>OCR: recognize(content, language)
OCR-->>PIPE: text
end
loop Each PostProcessor
PIPE->>PP: process(text, context)
PP-->>PIPE: processed_text
end
PIPE->>FMT: format(text, metadata)
FMT-->>PIPE: ProcessingResult
PIPE-->>SVC: Document (completed)
SVC-->>API: DocumentResponse ExtractorRouter¶
The router selects the best extractor based on MIME type and user preference:
| MIME Type | Default Extractor | Reason |
|---|---|---|
application/pdf | PyMuPDF | 10-50x faster, no network |
application/msword | Tika | Office format support |
image/* | Tika → OCR | Fallback to OCR |
| Everything else | Tika | Universal support |
Users can override with extraction_engine=tika or extraction_engine=pymupdf.
OCR Decision Logic¶
def _needs_ocr(mime_type, text, ocr_engine):
if ocr_engine == "none":
return False
if mime_type in IMAGE_TYPES:
return True
if mime_type == "application/pdf" and len(text.strip()) < 50:
return True # Scanned PDF
return False
Error Handling¶
The pipeline catches all exceptions and marks the document as failed:
try:
# Extract → Process → Format
document.mark_completed(content, elapsed_ms)
except Exception as exc:
document.mark_failed(str(exc))
No unhandled exceptions propagate to the API layer.