DocFlow¶

End-to-End Document Text Extraction Pipeline — PDF, Office, Images → Markdown.

DocFlow provides a simple, pluggable pipeline for extracting text from any document format and converting it into structured output. Built for enterprises that need adaptable, technology-agnostic document processing.

Features¶

Universal Extraction¶

PDF, DOCX, XLSX, PPTX, HTML, images, and 1000+ formats via Apache Tika.

Pluggable Architecture¶

Swap extractors, OCR engines, and post-processors without code changes — just config.

OCR Built-In¶

Tesseract OCR for scanned documents and images. Google Vision / Azure ready.

Docker-First¶

One command to start: docker compose up. Production-ready with health checks.

REST API¶

FastAPI with OpenAPI docs, file upload, sync extraction, and format selection.

Markdown Output¶

Clean Markdown with YAML front matter, or JSON / plain text.

Quick Start¶

# Clone the repository
git clone https://github.com/stefanposs/doc-flow.git
cd doc-flow

# Start with Docker
docker compose -f docker/docker-compose.yml up -d

# Extract text from a PDF
curl -X POST http://localhost:8000/api/v1/extract \
  -F "file=@document.pdf" \
  -F "output_format=markdown"

Architecture at a Glance¶

graph LR
    A[Document Upload] --> B[Extractor Router]
    B --> C{MIME Type?}
    C -->|PDF| D[PyMuPDF]
    C -->|Office/Other| E[Apache Tika]
    C -->|Image| F[Tesseract OCR]
    D --> G[Post-Processing]
    E --> G
    F --> G
    G --> H[Output Formatter]
    H --> I[Markdown / JSON / Text]

Documentation¶

Installation — setup and dependencies
Quick Start — first extraction in 5 minutes
Architecture — hexagonal design and bounded contexts
API Reference — REST endpoint documentation

Contributing¶

Contributions are welcome! See the Contributing Guide for development setup and workflow.

License¶

MIT — see LICENSE for details.