📊 Data Summarizer for LLMs

Generate compact, context-rich dataset summaries for LLM Context Injection. Optimized for Gemini, ChatGPT, and Claude context windows.

Why this tool?

🔒 Privacy First — Your data never leaves your machine

No API calls, no cloud uploads. The container runs entirely locally and processes your files in-memory. Nothing is sent anywhere.

📂 Batch Processing — One folder, all your files

Drop all your datasets (CSV, Excel, JSON, Parquet) into a single folder and run once. Every file gets its own summary.

⚡ Blazing Fast — Powered by Polars (Rust)

Analysis is handled by Polars, a Rust-based DataFrame engine. Even large files are processed in seconds.

🚀 Quick Start

No installation required. Docker only.

Step 1 — Create your working folders

mkdir -p input output

That's it — input/ for your files, output/ for the summaries.

⚠️ Do not skip this step — especially on Linux / macOS. If these folders don't exist before running the container, Docker creates them as root and the container won't be able to write your results. See Troubleshooting if you hit a PermissionError.

Step 2 — Drop your files

Copy any .csv, .xlsx, .xls, .json, or .parquet files into input/.

Step 3 — Run

Linux / macOS:

docker run --rm \
  -v "$(pwd)/input:/app/data/input" \
  -v "$(pwd)/output:/app/data/output" \
  abguven/data-summarizer:latest

Windows (PowerShell):

docker run --rm `
  -v "${PWD}/input:/app/data/input" `
  -v "${PWD}/output:/app/data/output" `
  abguven/data-summarizer:latest

Want to keep execution logs? Add -v "$(pwd)/logs:/app/logs" to the command (create the logs/ folder first with mkdir logs).

That's it. A SUMMARY_<filename>.md file is generated in output/ for each file processed.

📄 Output Example

Given a file employees.csv, the tool generates SUMMARY_employees.csv.md:

# 📊 Dataset Summary: employees.csv
- **Rows:** 1000
- **Columns:** 5

## 🧱 Column Details
| Column    | Type    | Missing | Unique | Stats / Distribution                | Examples               |
|-----------|---------|---------|--------|-------------------------------------|------------------------|
| name      | String  | 0.0%    | 1000   |                                     | Alice, Bob, Charlie    |
| age       | Int64   | 2.0%    | 45     | Min:18 Max:75 Avg:42 `▂▃▅█▅▃▂`     | 25, 30, 35             |
| city      | String  | 0.5%    | 23     |                                     | Paris, Lyon, Marseille |
| salary    | Float64 | 0.0%    | 850    | Min:2000 Max:9500 Avg:4800 `▂▃▄▅▆` | 3200.0, 4500.0         |
| is_active | Boolean | 0.0%    | 2      |                                     | true, false            |

Paste this Markdown directly into your LLM prompt — no file upload needed, no tokens wasted.

📦 Technical Specs

Feature	Detail
Base Image	`python:3.x-slim` (Debian, auto-patched weekly — see Release Pipeline)
User	`appuser` (UID 1000 / GID 1000) — non-root
Supported Formats	`.csv`, `.parquet`, `.json`, `.xlsx`, `.xls`
Engine	Polars (Rust-based)
Image Size	~90MB compressed (multi-stage build)

🔍 Troubleshooting

`PermissionError` on Linux / macOS

Symptom:

PermissionError: [Errno 13] Permission denied: '/app/data/output/...'

Cause: On Linux and macOS, if the output/ folder doesn't exist before running the container, Docker creates it automatically as root:root. The container runs as a non-root user (appuser) and cannot write to it.

Fix: Always create the folders yourself before running the container:

mkdir -p input output

👩‍💻 For Developers

This section is for contributors who want to modify the source code.

Setup

git clone https://github.com/abguven/data-summarizer-llm.git
cd data-summarizer-llm

# Create the local data folders
mkdir -p data/input data/output

Linux / macOS:

# Ubuntu/Debian: install venv if missing
sudo apt install python3-venv

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python src/summarize_dataset.py

Windows (PowerShell):

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
python src/summarize_dataset.py

Drop your test files into data/input/ — summaries will appear in data/output/.

Makefile commands (Linux / macOS only)

The Makefile uses bash and Docker CLI — it is not compatible with Windows PowerShell. Windows developers should use the docker build and docker run commands directly.

Command	Description
`make build`	Build the Docker image locally (`data-summarizer:local`)
`make demo`	Copy sample data into `data/input/` and run the local image
`make test`	Run the functional test suite against the local image
`make help`	List all available commands

Workflow (Linux / macOS)

# 1. Build your local image after making changes
make build

# 2. Smoke test with sample data
make demo

# 3. Run the full test suite
make test

Release Pipeline

Trigger	Workflow	Effect
Pull request → `main`	CI	Build image + run functional tests
Dependabot merges to `main`	Auto Release	Build, test, push patch version (e.g. `v1.4.1`) + `latest` to Docker Hub
Manual git tag `v*`	Release	Build, test, push versioned image + `latest` to Docker Hub

Security patches to the base OS are handled automatically — Dependabot opens a PR weekly, CI validates it, and a new patch version is published on Docker Hub without any manual action.

Contributing

Feel free to open issues or submit PRs. Ideas welcome:

SQL database support
Additional output formats (JSON, HTML)
More advanced statistics (percentiles, correlation)

Maintained by abguven.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github		.github
assets		assets
data		data
logs		logs
src		src
tests		tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 Data Summarizer for LLMs

Why this tool?

🔒 Privacy First — Your data never leaves your machine

📂 Batch Processing — One folder, all your files

⚡ Blazing Fast — Powered by Polars (Rust)

🚀 Quick Start

Step 1 — Create your working folders

Step 2 — Drop your files

Step 3 — Run

📄 Output Example

📦 Technical Specs

🔍 Troubleshooting

`PermissionError` on Linux / macOS

👩‍💻 For Developers

Setup

Makefile commands (Linux / macOS only)

Workflow (Linux / macOS)

Release Pipeline

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📊 Data Summarizer for LLMs

Why this tool?

🔒 Privacy First — Your data never leaves your machine

📂 Batch Processing — One folder, all your files

⚡ Blazing Fast — Powered by Polars (Rust)

🚀 Quick Start

Step 1 — Create your working folders

Step 2 — Drop your files

Step 3 — Run

📄 Output Example

📦 Technical Specs

🔍 Troubleshooting

PermissionError on Linux / macOS

👩‍💻 For Developers

Setup

Makefile commands (Linux / macOS only)

Workflow (Linux / macOS)

Release Pipeline

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`PermissionError` on Linux / macOS

Packages