DocFlow¶
End-to-End Document Text Extraction Pipeline — PDF, Office, Images → Markdown.
DocFlow provides a simple, pluggable pipeline for extracting text from any document format and converting it into structured output. Built for enterprises that need adaptable, technology-agnostic document processing.
Features¶
Universal Extraction¶
PDF, DOCX, XLSX, PPTX, HTML, images, and 1000+ formats via Apache Tika.
Pluggable Architecture¶
Swap extractors, OCR engines, and post-processors without code changes — just config.
OCR Built-In¶
Tesseract OCR for scanned documents and images. Google Vision / Azure ready.
Docker-First¶
One command to start: docker compose up. Production-ready with health checks.
REST API¶
FastAPI with OpenAPI docs, file upload, sync extraction, and format selection.
Markdown Output¶
Clean Markdown with YAML front matter, or JSON / plain text.
Quick Start¶
# Clone the repository
git clone https://github.com/stefanposs/doc-flow.git
cd doc-flow
# Start with Docker
docker compose -f docker/docker-compose.yml up -d
# Extract text from a PDF
curl -X POST http://localhost:8000/api/v1/extract \
-F "file=@document.pdf" \
-F "output_format=markdown"
Architecture at a Glance¶
graph LR
A[Document Upload] --> B[Extractor Router]
B --> C{MIME Type?}
C -->|PDF| D[PyMuPDF]
C -->|Office/Other| E[Apache Tika]
C -->|Image| F[Tesseract OCR]
D --> G[Post-Processing]
E --> G
F --> G
G --> H[Output Formatter]
H --> I[Markdown / JSON / Text] Documentation¶
- Installation — setup and dependencies
- Quick Start — first extraction in 5 minutes
- Architecture — hexagonal design and bounded contexts
- API Reference — REST endpoint documentation
Contributing¶
Contributions are welcome! See the Contributing Guide for development setup and workflow.
License¶
MIT — see LICENSE for details.