← Back to Portfolio
pdf-automation
Python OCR Automation python-docx AI / LLM

PDF Processing Pipeline

An end-to-end automated pipeline that converts scanned PDF pages into formatted Word documents — handling complex multi-column layouts, Latin/English side-by-side text, footnotes, and headings at scale.

Overview

The client needed 500 pages of scanned religious text (two-column Latin/English format) converted into precisely formatted .docx files. Manual typing would have taken weeks. The pipeline automated the entire process: OCR extraction, markup tagging, layout parsing, and Word generation.

A custom markup language was designed to tag document structure (headings, page headers, footnotes, two-column sections, italic text, superscripts) and a formatting engine converts this into pixel-perfect Word output.

Key Features

Tech Stack

Python python-docx Gemini Vision API Custom Parser Batch Runner QC Gate

Outcome

500 pages processed and delivered. The pipeline reduced what would have been weeks of manual work to a fully automated overnight run, with formatting quality exceeding the client's specifications.