GitHub - AbstractEndeavors/abstract-intelligence: Umbrella platform for the abstract_ OCR, PDF, video, and NLP stack

Abstract Intelligence Platform

A unified system for transforming raw media — documents, images, and video — into structured, searchable, and SEO-optimized data.

This platform combines ingestion, extraction, enrichment, and publishing into a cohesive pipeline designed for real-world data workflows and decision systems.

What This System Does

The platform processes unstructured media and converts it into structured text, enriched metadata, searchable datasets, and web-ready content.

It operates across multiple media types:

PDFs → structured documents + web pages
Images → OCR + metadata
Video → transcription + frame analysis + SEO data

End-to-End Pipeline

Raw Media (PDF / Image / Video)
        ↓
Ingestion Layer
        ↓
Extraction Layer
    ├─ OCR (documents + images)
    ├─ Transcription (video/audio)
    └─ Frame analysis (video)
        ↓
Structuring Layer
    ├─ Page-level / segment-level decomposition
    └─ Typed metadata generation
        ↓
Enrichment Layer
    ├─ Summarization
    ├─ Keyword extraction
    └─ Title + SEO generation
        ↓
Persistence Layer
    ├─ Filesystem (structured assets)
    └─ Database (JSONB metadata)
        ↓
Output Layer
    ├─ Static HTML (galleries, viewers)
    ├─ Searchable datasets
    └─ API-ready content

Platform Architecture

The platform is organized as a modular media pipeline:

abstract_hugpy — summarization, keyword extraction, metadata generation, and refinement
abstract_pdfs — PDF decomposition, manifests, and HTML generation
abstract_videos — video ingestion, transcription, frame extraction, and media metadata
abstract_ocr — layout-aware OCR and structured text extraction

flowchart LR

    A1[PDFs]
    A2[Images]
    A3[Videos]

    B1[abstract_pdfs\nDocument decomposition]
    B2[abstract_ocr\nLayout-aware OCR]
    B3[abstract_videos\nTranscription + frame extraction]

    C1[abstract_hugpy\nSummaries, keywords,\nmetadata, refinement]

    D1[Structured Filesystem\npages, images, text, manifests]
    D2[Database / JSONB\nmetadata, transcripts,\naggregated outputs]

    E1[Static HTML\nviewers + galleries]
    E2[Searchable Corpus]
    E3[API / SEO / LLM-ready Data]

    A1 --> B1
    A2 --> B2
    A3 --> B3

    B1 --> B2
    B1 --> C1
    B2 --> C1
    B3 --> C1

    B1 --> D1
    B2 --> D1
    B3 --> D2
    C1 --> D1
    C1 --> D2

    D1 --> E1
    D1 --> E2
    D2 --> E2
    D2 --> E3
    C1 --> E3

System Components

`abstract_pdfs` — Document Pipeline

Transforms PDFs into structured, SEO-ready content.

Page-level decomposition (text + images)
Metadata + manifest generation
Static HTML generation (viewer + gallery)
SEO tagging and keyword extraction

Output: searchable document corpus

`abstract_ocr` — Extraction Engine

Multi-engine OCR system with layout awareness.

Column detection and region segmentation
Multi-engine fallback (Tesseract / EasyOCR / PaddleOCR)
Structured text with positional metadata

Output: reliable text extraction across layouts

`abstract_videos` — Video Pipeline

Multimodal processing for video content.

Video ingestion + metadata registry
Whisper transcription + frame OCR
NLP enrichment (titles, keywords, summaries)
Structured persistence (JSONB + filesystem)

Output: searchable, enriched video data

`abstract_hugpy` — NLP / ML Layer

Content understanding and enrichment.

Summarization pipelines (chunked + consolidated)
Keyword extraction and refinement
Metadata generation and scoring

Output: semantic understanding of content

Architecture Principles

1. Layered Processing — Each stage is isolated: ingestion, extraction, enrichment, persistence. No tight coupling between layers.

2. Structured Over Raw — Everything becomes JSON, typed metadata, and normalized fields. Not raw blobs.

3. Deterministic Pipelines — Idempotent processing, resumable execution, explicit state tracking.

4. Local-First, Cloud-Optional — Runs entirely on local infrastructure. External APIs are optional enhancements with no dependency on managed services.

5. Multimodal Convergence — Combines text (OCR + transcription), images (frame analysis), and documents (PDF parsing) into a single unified data model.

Persistence Model

Filesystem stores media assets (images, thumbnails, audio), page-level text, and structured directories.

Database stores JSONB for metadata, transcripts, keywords, and aggregated outputs.

What This Enables

Searchable media archives
SEO-driven content platforms
Document + video knowledge bases
LLM-ready datasets
Automated content pipelines

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstract Intelligence Platform

What This System Does

End-to-End Pipeline

Platform Architecture

System Components

`abstract_pdfs` — Document Pipeline

`abstract_ocr` — Extraction Engine

`abstract_videos` — Video Pipeline

`abstract_hugpy` — NLP / ML Layer

Architecture Principles

Persistence Model

What This Enables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Abstract Intelligence Platform

What This System Does

End-to-End Pipeline

Platform Architecture

System Components

abstract_pdfs — Document Pipeline

abstract_ocr — Extraction Engine

abstract_videos — Video Pipeline

abstract_hugpy — NLP / ML Layer

Architecture Principles

Persistence Model

What This Enables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

`abstract_pdfs` — Document Pipeline

`abstract_ocr` — Extraction Engine

`abstract_videos` — Video Pipeline

`abstract_hugpy` — NLP / ML Layer

Packages