This module provides layout-aware OCR as part of a larger media processing system.
abstract_ocr focuses on extraction:
- multi-engine OCR (Tesseract / EasyOCR / PaddleOCR)
- column detection and region segmentation
- structured, position-aware text output
Full system: https://github.com/AbstractEndeavors/abstract-media-intelligence
A structured OCR pipeline designed for layout-aware text extraction from complex documents, combining preprocessing, column detection, region classification, and ordered OCR assembly.
Built to handle:
- multi-column PDFs
- mixed-content layouts (text, figures, captions)
- noisy or scanned documents
- large-scale document ingestion pipelines
This is not a simple OCR wrapper — it is a typed, multi-stage processing pipeline:
- transforms raw images into structured page representations
- detects document layout (columns, headers, regions)
- classifies content blocks (text, figures, captions)
- applies OCR at the region level
- reconstructs output in correct reading order
The system is designed for deterministic, reproducible extraction rather than heuristic text scraping.
PDF Input
↓
Slice / Decompose (images + text per page)
↓
OCR + Text Extraction (layout-aware engines)
↓
Metadata Generation
├─ summaries
├─ keywords
└─ descriptions
↓
Manifest Creation (per-page + per-document)
↓
HTML Generation
├─ PDF viewer pages
└─ gallery index pages
↓
Static Site Output (SEO-ready)
flowchart TD
A[Input Image / Page Image]
B[Preprocess\nDenoise + Binarize]
C[Layout Detection\nColumns + Header Cutoff]
D[Region Classification\nText / Figure / Caption]
E[Region OCR\nCrop + Tesseract]
F[Fallback OCR\nColumn-level OCR]
G[Reading Order Assembly]
H[Structured OCRResult\nBlocks + Raw Text + Layout]
A --> B --> C --> D --> E --> G --> H
D -->|No usable regions| F --> G
-
Layout Detection
- Column detection via vertical projection valleys
- Header segmentation via density scanning
- Multi-column classification (single / dual / mixed)
-
Region Classification
- Connected-component analysis
- Density-based classification (text vs figure vs caption)
- Column-aware region assignment
-
Region-Level OCR
- OCR applied per detected block (not full-page)
- Adaptive Tesseract configuration by region type
- Automatic fallback to column-level OCR when detection fails
-
Reading Order Reconstruction
- Column-aware ordering
- Top-to-bottom sequencing within columns
- Header/body/caption prioritization
-
Typed Pipeline Execution
- All steps validated via explicit input/output types
- Registry-driven execution model
- No implicit coupling between pipeline stages
The pipeline is built around a step registry + type-safe execution chain:
-
Each step declares:
- input type
- output type
-
The pipeline validates compatibility before execution
-
Execution is explicit, deterministic, and observable
Example chain:
["preprocess", "detect_layout", "ocr_regions"]Each step is independently replaceable and composable.
All intermediate results are structured dataclasses:
PageImagePreprocessedImageLayoutDetectionOCRResult
No ad-hoc dictionaries — ensures:
- traceability
- consistency
- debuggability
OCR is applied after structure is understood, not before.
This prevents:
- column interleaving
- incorrect reading order
- misclassification of content
If region detection fails:
- system falls back to column-level OCR
- ensures output is still usable
- explicit thresholds (config-driven)
- no hidden behavior
- reproducible results across runs
Traditional OCR pipelines:
- ignore layout
- operate on full pages
- produce inconsistent reading order
- fail silently on complex documents
This system:
- understands document structure
- isolates regions before OCR
- enforces reading order
- produces structured outputs suitable for downstream systems
- PDF → structured text extraction
- research document ingestion pipelines
- financial filings parsing
- multi-column article extraction
- preprocessing for NLP / LLM pipelines
- search indexing and document analysis
This module is designed to plug into:
- document ingestion systems
- OCR + NLP pipelines (e.g. abstract_hugpy)
- search and indexing systems
- large-scale document processing workflows
- Structure before extraction
- Determinism over convenience
- Typed pipelines over implicit flows
- Fallback over failure