DataWagon

Automated loading of YouTube Analytics CSV files into Google Cloud Storage buckets with BigQuery external table management.

Overview

DataWagon processes compressed YouTube Analytics files (.csv.gz, .csv.zip), validates filenames against regex patterns, extracts metadata, and uploads them to Google Cloud Storage with automatic partitioning by report date. It also manages BigQuery external tables that reference the uploaded files, enabling direct SQL queries without data duplication.

Background

This project was built to replace an existing process which used a bash script to load uncompressed files into a PostgreSQL database. The original script could not handle compressed files, required manually moving files in and out of an import directory, had no mechanism to check for duplicate files, and required code modification and manual table creation to add new types of files.

Key Features

Automatic compressed file handling (.csv.gz, .csv.zip)
Duplicate file detection and prevention
Configurable file type patterns via TOML configuration
GCS bucket integration with automatic partitioning
Metadata extraction from filenames using regex
Version-based folder organization for BigQuery external table mapping
BigQuery external table management - create and list tables that reference GCS files
Automatic schema detection from CSV headers
Hive partitioning support for efficient date-based queries
Comprehensive validation and user feedback

Prerequisites

Python 3.9 or higher (3.12 recommended)
Google Cloud Platform account with Storage and BigQuery access
Google Cloud credentials configured locally

Optional:

Poetry 2.0 or higher (for dependency management)

First-Time Setup

Option 1: Quick Setup (Without Poetry)

For users who want to use DataWagon without Poetry, choose your operating system:

Windows

git clone https://github.com/jtmcn/datawagon.git
cd datawagon
setup-venv.bat
.venv\Scripts\activate.bat
datawagon --help

This creates a standard Python virtual environment and installs DataWagon from source.

macOS / Linux

git clone https://github.com/jtmcn/datawagon.git
cd datawagon
./setup-venv.sh
source .venv/bin/activate
datawagon --help

This creates a standard Python virtual environment and installs DataWagon from source.

Option 2: Development Setup (With Poetry)

Poetry is recommended if you plan to modify dependencies or contribute to development.

1. Clone Repository

git clone https://github.com/jtmcn/datawagon.git
cd datawagon

2. Install Poetry

If you don't have Poetry installed:

# macOS/Linux
curl -sSL https://install.python-poetry.org | python3 -

# Or via pipx
pipx install poetry

Verify installation:

poetry --version

3. Run Setup

make setup-poetry

This will:

Check Poetry installation
Install required Poetry plugins (poetry-plugin-export)
Create .env from .env.example
Install all dependencies
Create virtual environment in .venv/

Installation Methods Compared

Feature	Poetry (Development)	Standard venv (Runtime)
Platforms	Windows, macOS, Linux	Windows, macOS, Linux
Install DataWagon	✓	✓
Run commands	✓	✓
Modify application code	✓	✓
Development tools (mypy, black, flake8, pytest)	✓	✗
Add/modify dependencies	✓	Request via issue
Lock dependencies	✓	Use requirements.txt
Installation time	~2 min	~30 sec
Disk space	~500MB	~200MB
Use case	Development, contributions	Running the app only

Recommendation:

Use standard venv if you just want to use DataWagon (simpler, faster, smaller)
Use Poetry if you're developing, contributing, or managing dependencies

Configuration (Both Options)

Edit .env with your settings:

# Local CSV source directory
DW_CSV_SOURCE_DIR=/path/to/your/csv/files

# Source configuration (keep default unless customizing)
DW_CSV_SOURCE_TOML=./datawagon-config.toml

# Google Cloud Storage settings
DW_GCS_PROJECT_ID=your-gcp-project-id
DW_GCS_BUCKET=your-bucket-name

# BigQuery settings (can also be set in datawagon-config.toml [bigquery] section)
# Environment variables take precedence over TOML configuration
DW_BQ_DATASET=your_dataset_name
# DW_BQ_STORAGE_PREFIX=caravan-versioned  # Optional: defaults to "caravan-versioned"

Alternatively, you can configure BigQuery settings in datawagon-config.toml:

# BigQuery configuration
[bigquery]
dataset = "your_dataset_name"
storage_prefix = "caravan-versioned"  # Optional: defaults to "caravan-versioned"

Configuration precedence for BigQuery settings: CLI flag > Environment variable > TOML config

5. Configure Google Cloud Credentials

# Authenticate with GCP
gcloud auth application-default login

# Or set service account key
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json

6. Verify Installation

source .venv/bin/activate
datawagon --help

Updating Existing Installation

With Poetry

To pull latest changes and update dependencies:

./update.sh

Without Poetry (Standard venv)

Windows

To pull latest changes and update dependencies:

update-venv.bat

Or manually:

git pull
.venv\Scripts\activate.bat
pip install -e . --upgrade

macOS / Linux

To pull latest changes and update dependencies:

./update-venv.sh

Or manually:

git pull
source .venv/bin/activate
pip install -e . --upgrade

Switching Installation Methods

From Poetry to Standard venv

rm -rf .venv
./setup-venv.sh
source .venv/bin/activate
datawagon --help

What changes:

Smaller: ~200MB vs ~500MB
No dev tools (mypy, black, flake8, pytest)
Faster setup: ~30 sec vs ~2 min
Cannot modify dependencies

From Standard venv to Poetry

curl -sSL https://install.python-poetry.org | python3 -
rm -rf .venv
make setup-poetry
source .venv/bin/activate
make test

What you gain:

Full development tooling
Ability to modify dependencies
Run pre-commit checks locally

Verifying Installation Type

make verify-install  # Shows installation type

Usage

Activate Virtual Environment

source .venv/bin/activate

Or run commands via Poetry:

poetry run datawagon --help

Available Commands

File Management:

# List files in local source directory
datawagon files-in-local-fs

# List files already in GCS bucket
datawagon files-in-storage

# Compare local files to bucket
datawagon compare-local-to-bucket

# Upload new files to GCS
datawagon upload-to-gcs

# Convert .zip files to .gzip format
datawagon file-zip-to-gzip

BigQuery External Tables:

# List existing external tables in BigQuery dataset
datawagon list-bigquery-tables

# Create external tables for GCS folders without tables
datawagon create-bigquery-tables

# Drop external tables (dry-run by default)
datawagon drop-bigquery-tables

# Actually execute deletion (requires confirmation)
datawagon drop-bigquery-tables --execute

BigQuery External Tables

After uploading files to GCS, you can create BigQuery external tables that reference the files directly. This allows you to query the CSV data using SQL without loading it into BigQuery storage.

List existing tables:

datawagon list-bigquery-tables

This displays:

Table names (with version suffixes, e.g., claim_raw_v1_1)
Creation timestamps
Partitioning status
Source URI patterns

Create missing tables:

datawagon create-bigquery-tables

This will:

Scan GCS bucket for folders under the configured prefix (default: caravan-versioned)
List existing BigQuery external tables
Identify folders without corresponding tables
Show a summary of tables to be created
Prompt for confirmation
Create external tables with:
- Auto-detected schema from CSV headers
- Hive partitioning on report_date column (for partitioned folders)
- GZIP compression support
- CSV format configuration

Example table creation:

Storage folder: caravan-versioned/claim_raw_v1-1
→ BigQuery table: claim_raw_v1_1
→ Source URI: gs://bucket/caravan-versioned/claim_raw_v1-1/*
→ Partitioning: Yes (report_date)

Querying external tables:

Once created, query the tables in BigQuery:

-- Query with partition filtering for efficiency
SELECT * FROM `project.dataset.claim_raw_v1_1`
WHERE report_date = '2023-06-30'
LIMIT 10;

-- Aggregate across all partitions
SELECT report_date, COUNT(*) as claim_count
FROM `project.dataset.claim_raw_v1_1`
GROUP BY report_date
ORDER BY report_date DESC;

Key benefits:

No data duplication (files stay in GCS, BigQuery reads directly)
Automatic schema detection (no manual schema definition)
Partition pruning for fast date-filtered queries
Multiple versions can coexist as separate tables

Configuration:

Control which folders are scanned with the DW_BQ_STORAGE_PREFIX setting:

# Only scan caravan-versioned folders (default)
DW_BQ_STORAGE_PREFIX=caravan-versioned

# Scan all folders (legacy behavior)
DW_BQ_STORAGE_PREFIX=""

# Scan a different prefix
DW_BQ_STORAGE_PREFIX=my-custom-folder

Drop tables:

# Dry-run: Show what would be deleted (safe, default)
datawagon drop-bigquery-tables

# Drop all tables in dataset (requires confirmation)
datawagon drop-bigquery-tables --execute

# Drop specific table
datawagon drop-bigquery-tables --execute --table-name claim_raw_v1_1

# Drop from specific dataset
datawagon drop-bigquery-tables --execute --dataset my_dataset

This command:

Defaults to dry-run mode (safe preview)
Requires --execute flag to actually delete
Shows all tables to be dropped before deletion
Requires explicit user confirmation
Only deletes table metadata (CSV files in GCS remain untouched)
Provides detailed progress and error reporting

Safety features:

Dry-run mode by default prevents accidental deletions
User must explicitly confirm before any deletion
Individual table errors don't stop batch operations
Comprehensive logging for audit trail

Chaining Commands

Commands can be chained together:

datawagon files-in-local-fs compare-local-to-bucket upload-to-gcs

Typical Workflow

When you have new files to upload:

cd ~/Code/datawagon
source .venv/bin/activate
datawagon files-in-local-fs compare-local-to-bucket upload-to-gcs

This will:

Scan local directory for valid CSV files
Check for duplicates and invalid filenames
Compare with files already in GCS bucket
Show summary of new files to upload
Prompt for confirmation
Upload files with partitioning by report date

Complete workflow with BigQuery table creation:

datawagon upload-to-gcs create-bigquery-tables

This uploads new files and automatically creates any missing BigQuery external tables in one command chain.

Configuration

File Type Configuration

Edit datawagon-config.toml to define file types and patterns:

[file.youtube_claims]
select_file_name_base = "YouTube_*_claim_*"
exclude_file_name_base = "YouTube_*_claim_historical_*"
regex_pattern = "YouTube_(.+)_M_(\\d{8}|\\d{6})"
regex_group_names = ["content_owner", "file_date_key"]
storage_folder_name = "youtube_claims"
table_name = "claims"
table_append_or_replace = "append"

Configuration Fields:

select_file_name_base: Glob pattern to match files
exclude_file_name_base: Glob pattern to exclude files
regex_pattern: Regex to extract metadata from filename
regex_group_names: Named groups from regex (e.g., ["content_owner", "file_date_key"])
storage_folder_name: GCS destination folder
table_name: Table identifier for tracking
table_append_or_replace: Upload strategy (append or replace)

Special Metadata Fields

file_date_key: Auto-converts YYYYMMDD or YYYYMM to YYYY-MM-DD format for partitioning
content_owner: Typically extracted from filename for organization

GCS Upload Structure

Files are uploaded with this structure:

gs://your-bucket/storage_folder_name/report_date=YYYY-MM-DD/filename.csv.gz

For versioned files, the version is automatically appended to the folder name:

gs://your-bucket/storage_folder_name_v1-1/report_date=YYYY-MM-DD/filename.csv.gz

Examples:

# Non-versioned file
gs://my-bucket/caravan-versioned/some_file/report_date=2024-01-15/file.csv.gz

# Versioned file (version automatically extracted from filename)
gs://my-bucket/caravan-versioned/claim_raw_v1-1/report_date=2024-01-15/YouTube_BrandName_M_20240115_claim_raw_v1-1.csv.gz

This ensures each file version has a stable folder path for BigQuery external table mapping.

Development

Setup Development Environment

make setup
source .venv/bin/activate

Code Quality Tools

make type       # Type check with mypy
make isort      # Sort imports
make format     # Format code with black
make lint       # Lint with flake8
make test       # Run tests with pytest

Run All Checks

make pre-commit      # Run all quality checks
make pre-commit-fast # Run faster checks (type, lint, test only)

Update Dependencies

# Update all dependencies
make update

# Add new dependency
poetry add <package-name>

# Add dev dependency
poetry add --group dev <package-name>

# Regenerate requirements.txt
make requirements

Testing

# Run all tests
make test

# Run with coverage report
make test-cov

# Run specific test
poetry run pytest tests/file_utils_test.py -k test_group_by_base_name

Pre-commit Hooks

Pre-commit hooks run automatically on git commit. They will:

Run all code quality checks (type, isort, format, lint, test)
Verify requirements.txt is in sync with poetry.lock

To skip during development:

git commit --no-verify

Build Distribution

make build-app

Builds wheel and source distribution in dist/ directory.

Project Structure

datawagon/
├── datawagon/
│   ├── __main__.py              # CLI entry point
│   ├── main.py                  # Click CLI setup
│   ├── commands/                # Command implementations
│   │   ├── files_in_local_fs.py
│   │   ├── files_in_storage.py
│   │   ├── compare.py
│   │   ├── upload_to_storage.py
│   │   ├── file_zip_to_gzip.py
│   │   ├── list_bigquery_tables.py
│   │   └── create_bigquery_tables.py
│   ├── objects/                 # Core data models
│   │   ├── app_config.py
│   │   ├── source_config.py
│   │   ├── managed_file_metadata.py
│   │   ├── managed_file_scanner.py
│   │   └── bigquery_table_metadata.py
│   └── bucket/                  # GCS and BigQuery integration
│       ├── gcs_manager.py
│       └── bigquery_manager.py
├── tests/                       # Test suite
│   ├── test_bigquery_manager.py
│   ├── test_bigquery_table_metadata.py
│   ├── test_create_bigquery_tables.py
│   └── ...
├── pyproject.toml               # Poetry configuration
├── poetry.lock                  # Locked dependencies
├── datawagon-config.toml        # File type configuration
├── .env.example                 # Environment template
├── Makefile                     # Development commands
└── update.sh                    # Update script

Supported File Extensions

.csv - Uncompressed CSV
.csv.gz - Gzip compressed CSV
.csv.zip - Zip compressed CSV

Files starting with .~lock are automatically excluded.

Troubleshooting

Poetry export command not found

Install the export plugin:

poetry self add poetry-plugin-export

Virtual environment not found

Recreate the environment:

rm -rf .venv
make setup

Requirements.txt out of sync

Regenerate from poetry.lock:

make requirements

Command not found after setup

If datawagon command is not found after running setup:

# Verify virtual environment activation
which python  # Should show /path/to/datawagon/.venv/bin/python

# If not in venv, activate it
source .venv/bin/activate

# Verify installation
datawagon --help

If still not working:

# Reinstall from scratch
rm -rf .venv
./setup-venv.sh
source .venv/bin/activate

Why don't I have pytest/mypy/black?

These are development tools, not included in standard venv installation (runtime-only).

If you need development tools, use Poetry:

rm -rf .venv
make setup-poetry

If you just want to run tests occasionally, install manually:

source .venv/bin/activate
pip install pytest mypy black flake8

Can I develop without Poetry?

Yes, but it's not recommended:

Code changes: Edit files normally in either installation
Run application: Works the same in both
Run tests: Manually install dev dependencies (see above)
Modify dependencies: Not possible - must request via GitHub issue

For serious development or contributions, Poetry is strongly recommended.

How do I know which installation I have?

make verify-install

This will show:

Installation type (Poetry vs Standard venv)
Virtual environment location
Whether installation is healthy

Python version mismatch

Symptom: setup-venv.sh fails with "Python 3.9+ required".

Solution:

# Check your Python version
python3 --version

# If Python 3.8 or earlier, upgrade Python first
# On macOS with Homebrew:
brew install python@3.11

# On Ubuntu/Debian:
sudo apt update
sudo apt install python3.11

# Verify new version
python3.11 --version

# Create venv with specific Python version
python3.11 -m venv .venv
source .venv/bin/activate
pip install -e .

Common causes:

System Python is outdated
Multiple Python versions installed (using wrong one)
Virtual environment created with old Python

pip installation fails with "externally-managed-environment"

Symptom: pip install fails with:

error: externally-managed-environment

Solution: This is expected on some modern Linux distros (Debian 12+, Ubuntu 23.04+). The virtual environment approach handles this:

# Virtual environment bypasses external management
python3 -m venv .venv
source .venv/bin/activate
pip install -e .  # Now works inside venv

Don't: Never use pip --break-system-packages - this is dangerous.

Shell script fails with "command not found: poetry"

Symptom: update.sh fails because Poetry not found.

This is expected:

update.sh is for Poetry users only
For non-Poetry users, use update-venv.sh instead

Solution:

# Non-Poetry users:
./update-venv.sh

# Poetry users - install Poetry first:
curl -sSL https://install.python-poetry.org | python3 -

Installation succeeds but datawagon command not found

Symptom: After ./setup-venv.sh completes, datawagon command not found.

Cause: Virtual environment not activated.

Solution:

# Activate virtual environment
source .venv/bin/activate

# Verify activation (should show .venv path)
which python

# Now datawagon should work
datawagon --help

To avoid: Add activation to your shell profile:

# Add to ~/.bashrc or ~/.zshrc
alias dw='cd /path/to/datawagon && source .venv/bin/activate'

setup-venv.sh hangs on "Installing dependencies"

Symptom: Script appears frozen during pip install -e ..

This is usually normal: First installation downloads ~100MB of packages. Can take 2-5 minutes on slow connections.

To monitor progress:

# Run with verbose output
rm -rf .venv
python3 -m venv .venv
source .venv/bin/activate
pip install -e . -v  # Verbose mode shows progress

If truly stuck (>10 minutes):

# Ctrl+C to cancel
# Check disk space
df -h .

# Check network
curl -I https://pypi.org/simple/

# Try with timeout
pip install -e . --timeout 60

Requirements file has merge conflict

Symptom: After git merge, requirements.txt has conflict markers:

<<<<<<< HEAD
package==1.0.0
=======
package==2.0.0
>>>>>>> branch

Solution: Never manually edit requirements files. Regenerate from poetry.lock:

# Don't manually resolve conflict
# Instead, resolve poetry.lock conflict first

git checkout --theirs poetry.lock  # Or --ours, depending on which you want

# Then regenerate requirements
make requirements

# Stage the regenerated files
git add requirements.txt requirements-dev.txt poetry.lock

Tests pass locally but fail in CI

Symptom: make test passes locally, but GitHub Actions fails.

Common causes:

Missing dependency in poetry.lock: CI installs from lock file only

# Add missing dependency
poetry add missing-package
poetry lock
make requirements
git add pyproject.toml poetry.lock requirements*.txt
git commit

Local vs CI Python version: CI tests on 3.9, 3.10, 3.11

# Test with specific Python version locally (requires pyenv)
pyenv install 3.9.18
pyenv local 3.9.18
rm -rf .venv
make setup
make test

Environment variable differences: CI doesn't have your local .env

# Check which env vars your code uses
grep -r "os.environ" datawagon/

# Ensure tests mock external dependencies
# (GCS, BigQuery, etc.)

Google Cloud authentication errors

Ensure credentials are configured:

# Option 1: Application default credentials
gcloud auth application-default login

# Option 2: Service account key
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json

Import errors or missing dependencies

Ensure dependencies are installed:

poetry install

BigQuery dataset not found

Create the dataset before running BigQuery commands:

bq mk --dataset your-project-id:your_dataset_name

BigQuery authentication errors

Ensure you have BigQuery permissions:

# Authenticate with GCP
gcloud auth application-default login

# Verify you have BigQuery access
bq ls --project_id=your-project-id

Required IAM roles:

roles/bigquery.dataEditor - Create and manage tables
roles/storage.objectViewer - Read CSV files from GCS

Windows: Batch script fails with "Python is not recognized"

Symptom: setup-venv.bat fails with "'python' is not recognized as an internal or external command".

Solution: Python is not in your PATH.

REM Check if Python is installed
python --version

REM If command not found, add Python to PATH:
REM 1. Find Python installation (usually C:\Python39 or C:\Users\<user>\AppData\Local\Programs\Python\Python39)
REM 2. Add to PATH via System Properties > Environment Variables
REM 3. Or use Python Launcher (py command):

py --version
py -3.9 -m venv .venv

Common causes:

Python installed but not added to PATH during installation
Using wrong Python launcher (python vs py)
Multiple Python versions conflicting

Windows: Activate.bat fails with "not recognized"

Symptom: After setup, .venv\Scripts\activate.bat command not found.

Solution:

REM Check if .venv was created
dir .venv\Scripts

REM Activate using full path
.venv\Scripts\activate.bat

REM Or navigate first
cd .venv\Scripts
activate.bat
cd ..\..

Windows: PowerShell execution policy error

Symptom: Using PowerShell instead of cmd.exe, get "cannot be loaded because running scripts is disabled".

Solution: Our scripts are .bat files (cmd.exe), not .ps1 (PowerShell).

REM Use cmd.exe (Command Prompt), not PowerShell
REM Or in PowerShell, run cmd first:
cmd
setup-venv.bat

If you need PowerShell support, request it via GitHub issue.

Windows: Git not found in batch script

Symptom: update-venv.bat fails with "'git' is not recognized".

Solution: Install Git for Windows or use GitHub Desktop.

REM Install Git: https://git-scm.com/download/win
REM Verify installation:
git --version

REM Or use GitHub Desktop GUI instead of update script

Contributing

Fork the repository
Create a feature branch
Make changes
Run make pre-commit to verify quality
Commit changes
Push and create pull request

Changelog

See CHANGELOG.md for a detailed history of changes and version releases.

License

MIT License - see LICENSE file for details

Support

For issues and questions:

GitHub Issues: https://github.com/jtmcn/datawagon/issues
Email: jtmcn.dev@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 229 Commits
.github/workflows		.github/workflows
.vscode		.vscode
datawagon		datawagon
tests		tests
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
DAVE.md		DAVE.md
GORDON.md		GORDON.md
LICENSE		LICENSE
Makefile		Makefile
datawagon-config.toml		datawagon-config.toml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
readme.md		readme.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup-venv.bat		setup-venv.bat
setup-venv.sh		setup-venv.sh
update-venv.bat		update-venv.bat
update-venv.sh		update-venv.sh
update.sh		update.sh

Folders and files

Latest commit

History

Repository files navigation

DataWagon

Overview

Background

Key Features

Prerequisites

First-Time Setup

Option 1: Quick Setup (Without Poetry)

Windows

macOS / Linux

Option 2: Development Setup (With Poetry)

1. Clone Repository

2. Install Poetry

3. Run Setup

Installation Methods Compared

Configuration (Both Options)

5. Configure Google Cloud Credentials

6. Verify Installation

Updating Existing Installation

With Poetry

Without Poetry (Standard venv)

Windows

macOS / Linux

Switching Installation Methods

From Poetry to Standard venv

From Standard venv to Poetry

Verifying Installation Type

Usage

Activate Virtual Environment

Available Commands

BigQuery External Tables

Chaining Commands

Typical Workflow

Configuration

File Type Configuration

Special Metadata Fields

GCS Upload Structure

Development

Setup Development Environment

Code Quality Tools

Run All Checks

Update Dependencies

Testing

Pre-commit Hooks

Build Distribution

Project Structure

Supported File Extensions

Troubleshooting

Poetry export command not found

Virtual environment not found

Requirements.txt out of sync

Command not found after setup

Why don't I have pytest/mypy/black?

Can I develop without Poetry?

How do I know which installation I have?

Python version mismatch

pip installation fails with "externally-managed-environment"

Shell script fails with "command not found: poetry"

Installation succeeds but datawagon command not found

setup-venv.sh hangs on "Installing dependencies"

Requirements file has merge conflict

Tests pass locally but fail in CI

Google Cloud authentication errors

Import errors or missing dependencies

BigQuery dataset not found

BigQuery authentication errors

Windows: Batch script fails with "Python is not recognized"

Windows: Activate.bat fails with "not recognized"

Windows: PowerShell execution policy error

Windows: Git not found in batch script

Contributing

Changelog

License

Support

About

Resources

License