Skip to content

DocFlow

End-to-End Document Text Extraction Pipeline — PDF, Office, Images → Markdown.

DocFlow provides a simple, pluggable pipeline for extracting text from any document format and converting it into structured output. Built for enterprises that need adaptable, technology-agnostic document processing.


Features

Universal Extraction

PDF, DOCX, XLSX, PPTX, HTML, images, and 1000+ formats via Apache Tika.

Pluggable Architecture

Swap extractors, OCR engines, and post-processors without code changes — just config.

OCR Built-In

Tesseract OCR for scanned documents and images. Google Vision / Azure ready.

Docker-First

One command to start: docker compose up. Production-ready with health checks.

REST API

FastAPI with OpenAPI docs, file upload, sync extraction, and format selection.

Markdown Output

Clean Markdown with YAML front matter, or JSON / plain text.


Quick Start

# Clone the repository
git clone https://github.com/stefanposs/doc-flow.git
cd doc-flow

# Start with Docker
docker compose -f docker/docker-compose.yml up -d

# Extract text from a PDF
curl -X POST http://localhost:8000/api/v1/extract \
  -F "file=@document.pdf" \
  -F "output_format=markdown"

Architecture at a Glance

graph LR
    A[Document Upload] --> B[Extractor Router]
    B --> C{MIME Type?}
    C -->|PDF| D[PyMuPDF]
    C -->|Office/Other| E[Apache Tika]
    C -->|Image| F[Tesseract OCR]
    D --> G[Post-Processing]
    E --> G
    F --> G
    G --> H[Output Formatter]
    H --> I[Markdown / JSON / Text]

Documentation


Contributing

Contributions are welcome! See the Contributing Guide for development setup and workflow.


License

MIT — see LICENSE for details.