Metadata-Driven CDC Framework

A zero-code-per-table Change Data Capture framework built on Snowflake native features (Streams, Tasks, Stored Procedures). Add a new source table to tables.yaml and the orchestrator auto-provisions the entire pipeline.

Objective

Replace hardcoded per-table ETL pipelines with a single metadata-driven framework that:

Captures inserts, updates, and deletes in near real-time
Scales to any number of source tables without new SQL code per table
Provides full audit logging and pipeline health monitoring

Results

50% reduction in pipeline latency (batch → near real-time)
20+ source tables onboarded with zero manual SQL per table
Idempotent merges — safe to re-run on failure
Full audit trail of every CDC run

Architecture

SAP S/4HANA Source Tables (SAP_RAW schema)
        ↓
Snowflake Streams (capture INSERT / UPDATE / DELETE)
        ↓
Snowflake Tasks (scheduled every 1 minute)
  WHEN SYSTEM$STREAM_HAS_DATA → trigger only when changes exist
        ↓
Merge Stored Procedure (idempotent MERGE INTO target)
        ↓
Analytics-Ready Target Tables (ANALYTICS schema)
        ↓
Audit Log + Pipeline Health View (CDC_META schema)

How It Works

Define your tables in config/tables.yaml
Run python src/orchestrator.py
The orchestrator reads the config, generates SQL from the template, and provisions:
- A Stream on each source table
- A Stored Procedure with the merge logic
- A Task that fires every minute when the stream has data
Monitor with python src/monitor.py

Setup

Prerequisites

Snowflake account with SYSADMIN or equivalent role
Python 3.9+

Environment Variables

export SNOWFLAKE_USER=your_user
export SNOWFLAKE_PASSWORD=your_password
export SNOWFLAKE_ACCOUNT=your_account
export SNOWFLAKE_WAREHOUSE=COMPUTE_WH

Run

pip install -r requirements.txt

# 1. Create schemas and metadata tables
# Run sql/01_setup.sql in Snowflake worksheet

# 2. Load mock source data
# Run sql/02_mock_data.sql in Snowflake worksheet

# 3. Provision all CDC pipelines from config
python src/orchestrator.py

# 4. Monitor pipeline health
python src/monitor.py

Project Structure

cdc-framework/
├── config/
│   └── tables.yaml                 # Add new tables here
├── sql/
│   ├── 01_setup.sql                # Schemas + metadata tables
│   ├── 02_mock_data.sql            # Sample SAP source data
│   └── 03_stream_task_template.sql # Template used by orchestrator
├── src/
│   ├── orchestrator.py             # Auto-provisions Streams + Tasks
│   └── monitor.py                  # Pipeline health dashboard
└── requirements.txt

Adding a New Table

Simply add an entry to config/tables.yaml:

- name: PURCHASE_ORDERS
  primary_key: PO_ID
  columns: [PO_ID, VENDOR_ID, AMOUNT, STATUS, CREATED_AT]
  scd_type: 1
  enabled: true

Then re-run python src/orchestrator.py. That's it.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
sql		sql
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Metadata-Driven CDC Framework

Objective

Results

Architecture

How It Works

Setup

Prerequisites

Environment Variables

Run

Project Structure

Adding a New Table

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Metadata-Driven CDC Framework

Objective

Results

Architecture

How It Works

Setup

Prerequisites

Environment Variables

Run

Project Structure

Adding a New Table

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages