Welcome to Crump

Examines and syncs CSV, Parquet, and CDF files into PostgreSQL or SQLite databases in batched files using easy to edit configuration files.

Overview

crump is a command-line tool and Python library for easy syncing CSV, Parquet, and CDF files to PostgreSQL or SQLite databases, and extracxting data from CDF files. It provides a declarative, configuration-based approach to data synchronization with automatic schema management..

Key Features

Data File Support

CSV Support: Read and sync standard CSV files
Native CDF Processing: Built-in support for Common Data Format (CDF) science files
Automatic Extraction: Extracts CDF variables to CSV, Parquet, or directly to database
Array Variable Handling: Automatically expands multi-dimensional array variables
Apache Parquet Support: Built-in support for Apache Parquet files and sync Parquet files directly to database
Extract to Parquet: Convert CDF files to Parquet format with --parquet flag

Data Synchronization

Configuration-Based: Examines your CSV files with the prepare command, and defines sync jobs in YAML with sensible column mappings
Column Mapping: Sync all columns, rename them, or only sync a subset
Automatic Table Creation: Creates target tables if they don't exist
Schema Evolution: Automatically adds new columns as needed, never deletes existing columns. Optionally keeps a history of data changes in a history table.
Index Management: Suggests and creates database indexes based on column types
Dual Interface: Use as a CLI tool or import as a Python library
Filename-Based Extraction: Extract values from filenames (dates, versions, etc.) and store in database columns
Automatic Cleanup: Delete stale records based on extracted filename values
Compound Primary Keys: Support for multi-column primary keys
Dry-Run Mode: Preview all changes without modifying the database
Idempotent Operations: Safe to run multiple times, uses upsert
Rich Output: Beautiful terminal output with Rich library

Quick Example

uv install crump # or pip install crump if you prefer

# Create a configuration file
crump prepare users.csv --config crump_config.yml --job users_sync

# Look at the mapping it generated for you in crump_config.yml and edit as needed. 
# Crump has mapped your columns and suggested keys and indexes

# get ready to sync - you db must be available
export DATABASE_URL="sqlite:///test.db"
# Or for Postgres
# export DATABASE_URL="postgresql://user:pass@localhost:5432/mydb"

# preview changes first (requires --db-url or DATABASE_URL)
crump sync users.csv --config crump_config.yml --job users_sync --dry-run

# Sync the file to database
crump sync users.csv --config crump_config.yml --job users_sync

# Later that day the v2 of the file arrives
# Sync the new file, old records from v1 are removed automatically, updates are applied to rows that match based on primary key
crump sync users_v2.csv --config crump_config.yml --job users_sync

Example Configuration

jobs:
  daily_sales:
    target_table: sales
    id_mapping:
      sale_id: id
    filename_to_column:
      template: "sales_[date].csv"
      columns:
        date:
          db_column: sync_date
          type: date
          use_to_delete_old_rows: true
    columns:
      product_id: product_id
      amount: amount

This configuration:

Syncs sales_YYYY-MM-DD.csv files to the sales table
Extracts the date from filename and stores it in sync_date column
Automatically deletes stale records for the same date after sync
Maps CSV columns to database columns

Documentation

📚 Read the full documentation

Installation Guide - Install crump
Quick Start - Get started in 5 minutes
Configuration - YAML configuration reference
CLI Reference - Command-line documentation
Features - Detailed feature documentation
API Reference - Python API documentation
Development - Contributing guide

Programmatic Usage

from pathlib import Path
from crump import sync_csv_to_db, CrumpConfig

# Load configuration
config = CrumpConfig.from_yaml(Path("crump_config.yml"))
job = config.get_job("my_job")

# Sync CSV to database (PostgreSQL or SQLite)
rows_synced = sync_csv_to_db(
    csv_path=Path("data.csv"),
    job=job,
    db_connection_string="postgresql://localhost/mydb"
)
print(f"Synced {rows_synced} rows")

Development

# Clone repository
git clone https://github.com/alastairtree/crump.git
cd crump

# Install with development dependencies
uv sync --all-extras

# Run tests
uv run pytest -v

# Generate documentation locally
./generate-docs.sh

See the Development Guide for detailed instructions.

Contributing

Contributions are welcome! Please see the Contributing Guide for details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Acknowledgments

Built with Click, Rich, psycopg3, and pytest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Welcome to Crump

Overview

Key Features

Data File Support

Data Synchronization

Quick Example

Example Configuration

Documentation

Programmatic Usage

Development

Contributing

License

Support

Acknowledgments

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Welcome to Crump

Overview

Key Features

Data File Support

Data Synchronization

Quick Example

Example Configuration

Documentation

Programmatic Usage

Development

Contributing

License

Support

Acknowledgments