Generate compact, context-rich dataset summaries for LLM Context Injection. Optimized for Gemini, ChatGPT, and Claude context windows.
No API calls, no cloud uploads. The container runs entirely locally and processes your files in-memory. Nothing is sent anywhere.
Drop all your datasets (CSV, Excel, JSON, Parquet) into a single folder and run once. Every file gets its own summary.
Analysis is handled by Polars, a Rust-based DataFrame engine. Even large files are processed in seconds.
No installation required. Docker only.
mkdir -p input outputThat's it β input/ for your files, output/ for the summaries.
β οΈ Do not skip this step β especially on Linux / macOS. If these folders don't exist before running the container, Docker creates them asrootand the container won't be able to write your results. See Troubleshooting if you hit aPermissionError.
Copy any .csv, .xlsx, .xls, .json, or .parquet files into input/.
Linux / macOS:
docker run --rm \
-v "$(pwd)/input:/app/data/input" \
-v "$(pwd)/output:/app/data/output" \
abguven/data-summarizer:latestWindows (PowerShell):
docker run --rm `
-v "${PWD}/input:/app/data/input" `
-v "${PWD}/output:/app/data/output" `
abguven/data-summarizer:latestWant to keep execution logs? Add
-v "$(pwd)/logs:/app/logs"to the command (create thelogs/folder first withmkdir logs).
That's it. A SUMMARY_<filename>.md file is generated in output/ for each file processed.
Given a file employees.csv, the tool generates SUMMARY_employees.csv.md:
# π Dataset Summary: employees.csv
- **Rows:** 1000
- **Columns:** 5
## π§± Column Details
| Column | Type | Missing | Unique | Stats / Distribution | Examples |
|-----------|---------|---------|--------|-------------------------------------|------------------------|
| name | String | 0.0% | 1000 | | Alice, Bob, Charlie |
| age | Int64 | 2.0% | 45 | Min:18 Max:75 Avg:42 `βββ
ββ
ββ` | 25, 30, 35 |
| city | String | 0.5% | 23 | | Paris, Lyon, Marseille |
| salary | Float64 | 0.0% | 850 | Min:2000 Max:9500 Avg:4800 `ββββ
β` | 3200.0, 4500.0 |
| is_active | Boolean | 0.0% | 2 | | true, false |Paste this Markdown directly into your LLM prompt β no file upload needed, no tokens wasted.
| Feature | Detail |
|---|---|
| Base Image | python:3.x-slim (Debian, auto-patched weekly β see Release Pipeline) |
| User | appuser (UID 1000 / GID 1000) β non-root |
| Supported Formats | .csv, .parquet, .json, .xlsx, .xls |
| Engine | Polars (Rust-based) |
| Image Size | ~90MB compressed (multi-stage build) |
Symptom:
PermissionError: [Errno 13] Permission denied: '/app/data/output/...'
Cause: On Linux and macOS, if the output/ folder doesn't exist before running the container, Docker creates it automatically as root:root. The container runs as a non-root user (appuser) and cannot write to it.
Fix: Always create the folders yourself before running the container:
mkdir -p input outputThis section is for contributors who want to modify the source code.
git clone https://github.com/abguven/data-summarizer-llm.git
cd data-summarizer-llm
# Create the local data folders
mkdir -p data/input data/outputLinux / macOS:
# Ubuntu/Debian: install venv if missing
sudo apt install python3-venv
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python src/summarize_dataset.pyWindows (PowerShell):
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
python src/summarize_dataset.pyDrop your test files into data/input/ β summaries will appear in data/output/.
The
Makefileuses bash and Docker CLI β it is not compatible with Windows PowerShell. Windows developers should use thedocker buildanddocker runcommands directly.
| Command | Description |
|---|---|
make build |
Build the Docker image locally (data-summarizer:local) |
make demo |
Copy sample data into data/input/ and run the local image |
make test |
Run the functional test suite against the local image |
make help |
List all available commands |
# 1. Build your local image after making changes
make build
# 2. Smoke test with sample data
make demo
# 3. Run the full test suite
make test| Trigger | Workflow | Effect |
|---|---|---|
Pull request β main |
CI | Build image + run functional tests |
Dependabot merges to main |
Auto Release | Build, test, push patch version (e.g. v1.4.1) + latest to Docker Hub |
Manual git tag v* |
Release | Build, test, push versioned image + latest to Docker Hub |
Security patches to the base OS are handled automatically β Dependabot opens a PR weekly, CI validates it, and a new patch version is published on Docker Hub without any manual action.
Feel free to open issues or submit PRs. Ideas welcome:
- SQL database support
- Additional output formats (JSON, HTML)
- More advanced statistics (percentiles, correlation)
Maintained by abguven.
