Architecture Overview
DocFlow follows a Hexagonal Architecture (Ports & Adapters) to ensure every technology is swappable without code changes.
Design Principles
| Principle | Implementation |
| Dependency Inversion | Domain knows no infrastructure — only Ports (ABCs) |
| Single Responsibility | Each adapter has exactly one responsibility |
| Open/Closed | New extractors/OCR engines via plugin — no core changes |
| Configuration over Code | Adapter selection via environment variables |
Bounded Contexts
graph LR
A[Ingestion] --> B[Extraction]
B --> C[Processing]
C --> D[Delivery]
| Context | Responsibility |
| Ingestion | File upload, validation, MIME detection, metadata |
| Extraction | Raw text extraction from documents |
| Processing | Cleanup, structuring, optional LLM enrichment |
| Delivery | Formatting (Markdown/JSON/Text), response |
Layer Architecture
graph TB
subgraph "Driving Adapters"
API[FastAPI REST API]
CLI[CLI Tool]
end
subgraph "Application Layer"
SVC[DocumentService]
PIPE[ProcessingPipeline]
ROUTER[ExtractorRouter]
end
subgraph "Domain Core"
MOD[Document Model]
PORTS[Port Interfaces]
end
subgraph "Driven Adapters"
TIKA[Tika Adapter]
PDF[PyMuPDF Adapter]
OCR[Tesseract Adapter]
FMT[Markdown Formatter]
STORE[Local Storage]
end
API --> SVC
CLI --> SVC
SVC --> PIPE
PIPE --> ROUTER
ROUTER --> PORTS
PORTS -.-> TIKA
PORTS -.-> PDF
PORTS -.-> OCR
PORTS -.-> FMT
PORTS -.-> STORE
Dependency Rule
Adapters → Application → Domain → NOTHING
- Domain has zero external dependencies
- Application depends only on domain ports
- Adapters implement ports and depend on external libraries
Project Structure
src/docflow/
├── domain/ # Zero external deps
│ ├── models.py # Document, Metadata, Enums
│ ├── events.py # Domain events
│ └── ports.py # ABC interfaces (5 ports)
├── application/ # Orchestration
│ ├── config.py # pydantic-settings
│ ├── pipeline.py # ExtractorRouter + ProcessingPipeline
│ └── service.py # DocumentService
├── adapters/
│ ├── inbound/
│ │ ├── api/ # FastAPI (app, routes, schemas, DI)
│ │ └── cli.py # CLI entry point
│ └── outbound/
│ ├── extractors/ # Tika, PyMuPDF
│ ├── ocr/ # Tesseract
│ ├── processors/ # Cleanup
│ ├── formatters/ # Markdown, JSON, Plain Text
│ └── storage/ # Local filesystem
Next Steps