Skip to content

abguven/data-summarizer-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Š Data Summarizer for LLMs

Generate compact, context-rich dataset summaries for LLM Context Injection. Optimized for Gemini, ChatGPT, and Claude context windows.

Docker Pulls Docker Image Size CI License


Why this tool?

πŸ”’ Privacy First β€” Your data never leaves your machine

No API calls, no cloud uploads. The container runs entirely locally and processes your files in-memory. Nothing is sent anywhere.

πŸ“‚ Batch Processing β€” One folder, all your files

Drop all your datasets (CSV, Excel, JSON, Parquet) into a single folder and run once. Every file gets its own summary.

⚑ Blazing Fast β€” Powered by Polars (Rust)

Analysis is handled by Polars, a Rust-based DataFrame engine. Even large files are processed in seconds.


Architecture Schema


πŸš€ Quick Start

No installation required. Docker only.

Step 1 β€” Create your working folders

mkdir -p input output

That's it β€” input/ for your files, output/ for the summaries.

⚠️ Do not skip this step β€” especially on Linux / macOS. If these folders don't exist before running the container, Docker creates them as root and the container won't be able to write your results. See Troubleshooting if you hit a PermissionError.

Step 2 β€” Drop your files

Copy any .csv, .xlsx, .xls, .json, or .parquet files into input/.

Step 3 β€” Run

Linux / macOS:

docker run --rm \
  -v "$(pwd)/input:/app/data/input" \
  -v "$(pwd)/output:/app/data/output" \
  abguven/data-summarizer:latest

Windows (PowerShell):

docker run --rm `
  -v "${PWD}/input:/app/data/input" `
  -v "${PWD}/output:/app/data/output" `
  abguven/data-summarizer:latest

Want to keep execution logs? Add -v "$(pwd)/logs:/app/logs" to the command (create the logs/ folder first with mkdir logs).

That's it. A SUMMARY_<filename>.md file is generated in output/ for each file processed.


πŸ“„ Output Example

Given a file employees.csv, the tool generates SUMMARY_employees.csv.md:

# πŸ“Š Dataset Summary: employees.csv
- **Rows:** 1000
- **Columns:** 5

## 🧱 Column Details
| Column    | Type    | Missing | Unique | Stats / Distribution                | Examples               |
|-----------|---------|---------|--------|-------------------------------------|------------------------|
| name      | String  | 0.0%    | 1000   |                                     | Alice, Bob, Charlie    |
| age       | Int64   | 2.0%    | 45     | Min:18 Max:75 Avg:42 `β–‚β–ƒβ–…β–ˆβ–…β–ƒβ–‚`     | 25, 30, 35             |
| city      | String  | 0.5%    | 23     |                                     | Paris, Lyon, Marseille |
| salary    | Float64 | 0.0%    | 850    | Min:2000 Max:9500 Avg:4800 `β–‚β–ƒβ–„β–…β–†` | 3200.0, 4500.0         |
| is_active | Boolean | 0.0%    | 2      |                                     | true, false            |

Paste this Markdown directly into your LLM prompt β€” no file upload needed, no tokens wasted.


πŸ“¦ Technical Specs

Feature Detail
Base Image python:3.x-slim (Debian, auto-patched weekly β€” see Release Pipeline)
User appuser (UID 1000 / GID 1000) β€” non-root
Supported Formats .csv, .parquet, .json, .xlsx, .xls
Engine Polars (Rust-based)
Image Size ~90MB compressed (multi-stage build)

πŸ” Troubleshooting

PermissionError on Linux / macOS

Symptom:

PermissionError: [Errno 13] Permission denied: '/app/data/output/...'

Cause: On Linux and macOS, if the output/ folder doesn't exist before running the container, Docker creates it automatically as root:root. The container runs as a non-root user (appuser) and cannot write to it.

Fix: Always create the folders yourself before running the container:

mkdir -p input output

πŸ‘©β€πŸ’» For Developers

This section is for contributors who want to modify the source code.

Setup

git clone https://github.com/abguven/data-summarizer-llm.git
cd data-summarizer-llm

# Create the local data folders
mkdir -p data/input data/output

Linux / macOS:

# Ubuntu/Debian: install venv if missing
sudo apt install python3-venv

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python src/summarize_dataset.py

Windows (PowerShell):

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
python src/summarize_dataset.py

Drop your test files into data/input/ β€” summaries will appear in data/output/.

Makefile commands (Linux / macOS only)

The Makefile uses bash and Docker CLI β€” it is not compatible with Windows PowerShell. Windows developers should use the docker build and docker run commands directly.

Command Description
make build Build the Docker image locally (data-summarizer:local)
make demo Copy sample data into data/input/ and run the local image
make test Run the functional test suite against the local image
make help List all available commands

Workflow (Linux / macOS)

# 1. Build your local image after making changes
make build

# 2. Smoke test with sample data
make demo

# 3. Run the full test suite
make test

Release Pipeline

Trigger Workflow Effect
Pull request β†’ main CI Build image + run functional tests
Dependabot merges to main Auto Release Build, test, push patch version (e.g. v1.4.1) + latest to Docker Hub
Manual git tag v* Release Build, test, push versioned image + latest to Docker Hub

Security patches to the base OS are handled automatically β€” Dependabot opens a PR weekly, CI validates it, and a new patch version is published on Docker Hub without any manual action.

Contributing

Feel free to open issues or submit PRs. Ideas welcome:

  • SQL database support
  • Additional output formats (JSON, HTML)
  • More advanced statistics (percentiles, correlation)

Maintained by abguven.