Skip to content

martin-devido/syndicate_pdf_table_extractor

Repository files navigation

Syndicate PDF Table Extractor

Open-sourced by AutonCorp - Advanced engineering automation

A powerful, intelligent table extraction system for PDF datasheets. Extracts real data tables while filtering out schematics, pin diagrams, and other false positives.

Built for engineers who need clean, readable data from complex technical documents.

Features

  • Smart Table Detection: Uses geometric analysis to identify real data tables
  • Beautiful ASCII Rendering: Renders tables as gorgeous ASCII art with proper alignment
  • False Positive Filtering: Automatically rejects circuit schematics, corrupted content, and artifacts
  • Multiple Output Formats: ASCII art, Markdown, and raw data
  • Manufacturer Agnostic: Works with datasheets from any manufacturer

Quick Start

Installation

pip install -r requirements.txt

Usage

# Extract tables from first page
python demo.py your_datasheet.pdf

# Extract from specific page (1-indexed)
python demo.py your_datasheet.pdf 3

Programmatic Usage

from smart_table_extractor import SmartTableExtractor
from true_geometric_renderer import TrueGeometricRenderer
import fitz

# Open PDF
doc = fitz.open("datasheet.pdf")
page = doc[0]

# Extract tables
extractor = SmartTableExtractor()
tables, corrupted_zones = extractor.extract_tables_from_page(page)

# Render beautifully
renderer = TrueGeometricRenderer()
pymupdf_tables = page.find_tables()

for i, table_data in enumerate(tables):
    if i < len(pymupdf_tables.tables):
        ascii_art = renderer.render(pymupdf_tables.tables[i], page)
        print(ascii_art)

doc.close()

What Makes It Smart?

Geometric Intelligence

  • Validates table structure (minimum rows/columns, reasonable dimensions)
  • Detects oversized cells (probably graphs, not tables)
  • Identifies over-segmentation (too many empty cells)

Content Analysis

  • Curve Detection: Rejects tables containing curves/circles (circuit schematics)
  • Text Corruption Detection: Identifies encoding issues
  • Content Density: Ensures tables aren't mostly empty

Table Classification

Automatically classifies tables as:

  • maximum_ratings
  • electrical_characteristics
  • pin_configuration
  • device_information
  • operating_conditions
  • And more...

Example Output

Command Line Demo

Terminal Demo

Beautiful ASCII Tables

TABLE 1: Maximum Ratings
Size: 4 rows × 3 columns
Position: (50.2, 120.5, 300.8, 180.2)

Beautiful ASCII Rendering:
┌─────────────────────┬────────┬─────────┐
│ Parameter           │ Symbol │ Value   │
├─────────────────────┼────────┼─────────┤
│ Supply Voltage      │ VCC    │ 16V     │
│ Input Voltage       │ VIN    │ VCC     │
│ Operating Temp      │ TA     │ 70°C    │
└─────────────────────┴────────┴─────────┘

Before vs After

Original PDF Table: Original PDF

↓ Transformed into clean ASCII ↓ Terminal Output

Components

  • smart_table_extractor.py: Core table detection and validation
  • true_geometric_renderer.py: Beautiful ASCII table rendering
  • demo.py: Example usage and command-line interface

Requirements

  • Python 3.7+
  • PyMuPDF (fitz) 1.23.0+

About AutonCorp

AutonCorp specializes in advanced engineering automation and productivity tools.

This table extractor emerged from our need to efficiently parse technical documentation and extract structured data from complex PDFs.

License

Open Source - Share with friends and build amazing things!

Part of AutonCorp's mission to democratize advanced engineering tools.

About

pdf table extractor using pymupdf which extracts vectorized tables and renders them into ascii and md format for agentic use.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages