Automated loading of YouTube Analytics CSV files into Google Cloud Storage buckets with BigQuery external table management.
DataWagon processes compressed YouTube Analytics files (.csv.gz, .csv.zip), validates filenames against regex patterns, extracts metadata, and uploads them to Google Cloud Storage with automatic partitioning by report date. It also manages BigQuery external tables that reference the uploaded files, enabling direct SQL queries without data duplication.
This project was built to replace an existing process which used a bash script to load uncompressed files into a PostgreSQL database. The original script could not handle compressed files, required manually moving files in and out of an import directory, had no mechanism to check for duplicate files, and required code modification and manual table creation to add new types of files.
- Automatic compressed file handling (.csv.gz, .csv.zip)
- Duplicate file detection and prevention
- Configurable file type patterns via TOML configuration
- GCS bucket integration with automatic partitioning
- Metadata extraction from filenames using regex
- Version-based folder organization for BigQuery external table mapping
- BigQuery external table management - create and list tables that reference GCS files
- Automatic schema detection from CSV headers
- Hive partitioning support for efficient date-based queries
- Comprehensive validation and user feedback
- Python 3.9 or higher (3.12 recommended)
- Google Cloud Platform account with Storage and BigQuery access
- Google Cloud credentials configured locally
Optional:
- Poetry 2.0 or higher (for dependency management)
For users who want to use DataWagon without Poetry, choose your operating system:
git clone https://github.com/jtmcn/datawagon.git
cd datawagon
setup-venv.bat
.venv\Scripts\activate.bat
datawagon --helpThis creates a standard Python virtual environment and installs DataWagon from source.
git clone https://github.com/jtmcn/datawagon.git
cd datawagon
./setup-venv.sh
source .venv/bin/activate
datawagon --helpThis creates a standard Python virtual environment and installs DataWagon from source.
Poetry is recommended if you plan to modify dependencies or contribute to development.
git clone https://github.com/jtmcn/datawagon.git
cd datawagonIf you don't have Poetry installed:
# macOS/Linux
curl -sSL https://install.python-poetry.org | python3 -
# Or via pipx
pipx install poetryVerify installation:
poetry --versionmake setup-poetryThis will:
- Check Poetry installation
- Install required Poetry plugins (poetry-plugin-export)
- Create
.envfrom.env.example - Install all dependencies
- Create virtual environment in
.venv/
| Feature | Poetry (Development) | Standard venv (Runtime) |
|---|---|---|
| Platforms | Windows, macOS, Linux | Windows, macOS, Linux |
| Install DataWagon | ✓ | ✓ |
| Run commands | ✓ | ✓ |
| Modify application code | ✓ | ✓ |
| Development tools (mypy, black, flake8, pytest) | ✓ | ✗ |
| Add/modify dependencies | ✓ | Request via issue |
| Lock dependencies | ✓ | Use requirements.txt |
| Installation time | ~2 min | ~30 sec |
| Disk space | ~500MB | ~200MB |
| Use case | Development, contributions | Running the app only |
Recommendation:
- Use standard venv if you just want to use DataWagon (simpler, faster, smaller)
- Use Poetry if you're developing, contributing, or managing dependencies
Edit .env with your settings:
# Local CSV source directory
DW_CSV_SOURCE_DIR=/path/to/your/csv/files
# Source configuration (keep default unless customizing)
DW_CSV_SOURCE_TOML=./datawagon-config.toml
# Google Cloud Storage settings
DW_GCS_PROJECT_ID=your-gcp-project-id
DW_GCS_BUCKET=your-bucket-name
# BigQuery settings (can also be set in datawagon-config.toml [bigquery] section)
# Environment variables take precedence over TOML configuration
DW_BQ_DATASET=your_dataset_name
# DW_BQ_STORAGE_PREFIX=caravan-versioned # Optional: defaults to "caravan-versioned"Alternatively, you can configure BigQuery settings in datawagon-config.toml:
# BigQuery configuration
[bigquery]
dataset = "your_dataset_name"
storage_prefix = "caravan-versioned" # Optional: defaults to "caravan-versioned"Configuration precedence for BigQuery settings: CLI flag > Environment variable > TOML config
# Authenticate with GCP
gcloud auth application-default login
# Or set service account key
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.jsonsource .venv/bin/activate
datawagon --helpTo pull latest changes and update dependencies:
./update.shTo pull latest changes and update dependencies:
update-venv.batOr manually:
git pull
.venv\Scripts\activate.bat
pip install -e . --upgradeTo pull latest changes and update dependencies:
./update-venv.shOr manually:
git pull
source .venv/bin/activate
pip install -e . --upgraderm -rf .venv
./setup-venv.sh
source .venv/bin/activate
datawagon --helpWhat changes:
- Smaller: ~200MB vs ~500MB
- No dev tools (mypy, black, flake8, pytest)
- Faster setup: ~30 sec vs ~2 min
- Cannot modify dependencies
curl -sSL https://install.python-poetry.org | python3 -
rm -rf .venv
make setup-poetry
source .venv/bin/activate
make testWhat you gain:
- Full development tooling
- Ability to modify dependencies
- Run pre-commit checks locally
make verify-install # Shows installation typesource .venv/bin/activateOr run commands via Poetry:
poetry run datawagon --helpFile Management:
# List files in local source directory
datawagon files-in-local-fs
# List files already in GCS bucket
datawagon files-in-storage
# Compare local files to bucket
datawagon compare-local-to-bucket
# Upload new files to GCS
datawagon upload-to-gcs
# Convert .zip files to .gzip format
datawagon file-zip-to-gzipBigQuery External Tables:
# List existing external tables in BigQuery dataset
datawagon list-bigquery-tables
# Create external tables for GCS folders without tables
datawagon create-bigquery-tables
# Drop external tables (dry-run by default)
datawagon drop-bigquery-tables
# Actually execute deletion (requires confirmation)
datawagon drop-bigquery-tables --executeAfter uploading files to GCS, you can create BigQuery external tables that reference the files directly. This allows you to query the CSV data using SQL without loading it into BigQuery storage.
List existing tables:
datawagon list-bigquery-tablesThis displays:
- Table names (with version suffixes, e.g.,
claim_raw_v1_1) - Creation timestamps
- Partitioning status
- Source URI patterns
Create missing tables:
datawagon create-bigquery-tablesThis will:
- Scan GCS bucket for folders under the configured prefix (default:
caravan-versioned) - List existing BigQuery external tables
- Identify folders without corresponding tables
- Show a summary of tables to be created
- Prompt for confirmation
- Create external tables with:
- Auto-detected schema from CSV headers
- Hive partitioning on
report_datecolumn (for partitioned folders) - GZIP compression support
- CSV format configuration
Example table creation:
Storage folder: caravan-versioned/claim_raw_v1-1
→ BigQuery table: claim_raw_v1_1
→ Source URI: gs://bucket/caravan-versioned/claim_raw_v1-1/*
→ Partitioning: Yes (report_date)
Querying external tables:
Once created, query the tables in BigQuery:
-- Query with partition filtering for efficiency
SELECT * FROM `project.dataset.claim_raw_v1_1`
WHERE report_date = '2023-06-30'
LIMIT 10;
-- Aggregate across all partitions
SELECT report_date, COUNT(*) as claim_count
FROM `project.dataset.claim_raw_v1_1`
GROUP BY report_date
ORDER BY report_date DESC;Key benefits:
- No data duplication (files stay in GCS, BigQuery reads directly)
- Automatic schema detection (no manual schema definition)
- Partition pruning for fast date-filtered queries
- Multiple versions can coexist as separate tables
Configuration:
Control which folders are scanned with the DW_BQ_STORAGE_PREFIX setting:
# Only scan caravan-versioned folders (default)
DW_BQ_STORAGE_PREFIX=caravan-versioned
# Scan all folders (legacy behavior)
DW_BQ_STORAGE_PREFIX=""
# Scan a different prefix
DW_BQ_STORAGE_PREFIX=my-custom-folderDrop tables:
# Dry-run: Show what would be deleted (safe, default)
datawagon drop-bigquery-tables
# Drop all tables in dataset (requires confirmation)
datawagon drop-bigquery-tables --execute
# Drop specific table
datawagon drop-bigquery-tables --execute --table-name claim_raw_v1_1
# Drop from specific dataset
datawagon drop-bigquery-tables --execute --dataset my_datasetThis command:
- Defaults to dry-run mode (safe preview)
- Requires
--executeflag to actually delete - Shows all tables to be dropped before deletion
- Requires explicit user confirmation
- Only deletes table metadata (CSV files in GCS remain untouched)
- Provides detailed progress and error reporting
Safety features:
- Dry-run mode by default prevents accidental deletions
- User must explicitly confirm before any deletion
- Individual table errors don't stop batch operations
- Comprehensive logging for audit trail
Commands can be chained together:
datawagon files-in-local-fs compare-local-to-bucket upload-to-gcsWhen you have new files to upload:
cd ~/Code/datawagon
source .venv/bin/activate
datawagon files-in-local-fs compare-local-to-bucket upload-to-gcsThis will:
- Scan local directory for valid CSV files
- Check for duplicates and invalid filenames
- Compare with files already in GCS bucket
- Show summary of new files to upload
- Prompt for confirmation
- Upload files with partitioning by report date
Complete workflow with BigQuery table creation:
datawagon upload-to-gcs create-bigquery-tablesThis uploads new files and automatically creates any missing BigQuery external tables in one command chain.
Edit datawagon-config.toml to define file types and patterns:
[file.youtube_claims]
select_file_name_base = "YouTube_*_claim_*"
exclude_file_name_base = "YouTube_*_claim_historical_*"
regex_pattern = "YouTube_(.+)_M_(\\d{8}|\\d{6})"
regex_group_names = ["content_owner", "file_date_key"]
storage_folder_name = "youtube_claims"
table_name = "claims"
table_append_or_replace = "append"Configuration Fields:
select_file_name_base: Glob pattern to match filesexclude_file_name_base: Glob pattern to exclude filesregex_pattern: Regex to extract metadata from filenameregex_group_names: Named groups from regex (e.g.,["content_owner", "file_date_key"])storage_folder_name: GCS destination foldertable_name: Table identifier for trackingtable_append_or_replace: Upload strategy (append or replace)
file_date_key: Auto-converts YYYYMMDD or YYYYMM to YYYY-MM-DD format for partitioningcontent_owner: Typically extracted from filename for organization
Files are uploaded with this structure:
gs://your-bucket/storage_folder_name/report_date=YYYY-MM-DD/filename.csv.gz
For versioned files, the version is automatically appended to the folder name:
gs://your-bucket/storage_folder_name_v1-1/report_date=YYYY-MM-DD/filename.csv.gz
Examples:
# Non-versioned file
gs://my-bucket/caravan-versioned/some_file/report_date=2024-01-15/file.csv.gz
# Versioned file (version automatically extracted from filename)
gs://my-bucket/caravan-versioned/claim_raw_v1-1/report_date=2024-01-15/YouTube_BrandName_M_20240115_claim_raw_v1-1.csv.gz
This ensures each file version has a stable folder path for BigQuery external table mapping.
make setup
source .venv/bin/activatemake type # Type check with mypy
make isort # Sort imports
make format # Format code with black
make lint # Lint with flake8
make test # Run tests with pytestmake pre-commit # Run all quality checks
make pre-commit-fast # Run faster checks (type, lint, test only)# Update all dependencies
make update
# Add new dependency
poetry add <package-name>
# Add dev dependency
poetry add --group dev <package-name>
# Regenerate requirements.txt
make requirements# Run all tests
make test
# Run with coverage report
make test-cov
# Run specific test
poetry run pytest tests/file_utils_test.py -k test_group_by_base_namePre-commit hooks run automatically on git commit. They will:
- Run all code quality checks (type, isort, format, lint, test)
- Verify requirements.txt is in sync with poetry.lock
To skip during development:
git commit --no-verifymake build-appBuilds wheel and source distribution in dist/ directory.
datawagon/
├── datawagon/
│ ├── __main__.py # CLI entry point
│ ├── main.py # Click CLI setup
│ ├── commands/ # Command implementations
│ │ ├── files_in_local_fs.py
│ │ ├── files_in_storage.py
│ │ ├── compare.py
│ │ ├── upload_to_storage.py
│ │ ├── file_zip_to_gzip.py
│ │ ├── list_bigquery_tables.py
│ │ └── create_bigquery_tables.py
│ ├── objects/ # Core data models
│ │ ├── app_config.py
│ │ ├── source_config.py
│ │ ├── managed_file_metadata.py
│ │ ├── managed_file_scanner.py
│ │ └── bigquery_table_metadata.py
│ └── bucket/ # GCS and BigQuery integration
│ ├── gcs_manager.py
│ └── bigquery_manager.py
├── tests/ # Test suite
│ ├── test_bigquery_manager.py
│ ├── test_bigquery_table_metadata.py
│ ├── test_create_bigquery_tables.py
│ └── ...
├── pyproject.toml # Poetry configuration
├── poetry.lock # Locked dependencies
├── datawagon-config.toml # File type configuration
├── .env.example # Environment template
├── Makefile # Development commands
└── update.sh # Update script
.csv- Uncompressed CSV.csv.gz- Gzip compressed CSV.csv.zip- Zip compressed CSV
Files starting with .~lock are automatically excluded.
Install the export plugin:
poetry self add poetry-plugin-exportRecreate the environment:
rm -rf .venv
make setupRegenerate from poetry.lock:
make requirementsIf datawagon command is not found after running setup:
# Verify virtual environment activation
which python # Should show /path/to/datawagon/.venv/bin/python
# If not in venv, activate it
source .venv/bin/activate
# Verify installation
datawagon --helpIf still not working:
# Reinstall from scratch
rm -rf .venv
./setup-venv.sh
source .venv/bin/activateThese are development tools, not included in standard venv installation (runtime-only).
If you need development tools, use Poetry:
rm -rf .venv
make setup-poetryIf you just want to run tests occasionally, install manually:
source .venv/bin/activate
pip install pytest mypy black flake8Yes, but it's not recommended:
- Code changes: Edit files normally in either installation
- Run application: Works the same in both
- Run tests: Manually install dev dependencies (see above)
- Modify dependencies: Not possible - must request via GitHub issue
For serious development or contributions, Poetry is strongly recommended.
make verify-installThis will show:
- Installation type (Poetry vs Standard venv)
- Virtual environment location
- Whether installation is healthy
Symptom: setup-venv.sh fails with "Python 3.9+ required".
Solution:
# Check your Python version
python3 --version
# If Python 3.8 or earlier, upgrade Python first
# On macOS with Homebrew:
brew install python@3.11
# On Ubuntu/Debian:
sudo apt update
sudo apt install python3.11
# Verify new version
python3.11 --version
# Create venv with specific Python version
python3.11 -m venv .venv
source .venv/bin/activate
pip install -e .Common causes:
- System Python is outdated
- Multiple Python versions installed (using wrong one)
- Virtual environment created with old Python
Symptom: pip install fails with:
error: externally-managed-environment
Solution: This is expected on some modern Linux distros (Debian 12+, Ubuntu 23.04+). The virtual environment approach handles this:
# Virtual environment bypasses external management
python3 -m venv .venv
source .venv/bin/activate
pip install -e . # Now works inside venvDon't: Never use pip --break-system-packages - this is dangerous.
Symptom: update.sh fails because Poetry not found.
This is expected:
update.shis for Poetry users only- For non-Poetry users, use
update-venv.shinstead
Solution:
# Non-Poetry users:
./update-venv.sh
# Poetry users - install Poetry first:
curl -sSL https://install.python-poetry.org | python3 -Symptom: After ./setup-venv.sh completes, datawagon command not found.
Cause: Virtual environment not activated.
Solution:
# Activate virtual environment
source .venv/bin/activate
# Verify activation (should show .venv path)
which python
# Now datawagon should work
datawagon --helpTo avoid: Add activation to your shell profile:
# Add to ~/.bashrc or ~/.zshrc
alias dw='cd /path/to/datawagon && source .venv/bin/activate'Symptom: Script appears frozen during pip install -e ..
This is usually normal: First installation downloads ~100MB of packages. Can take 2-5 minutes on slow connections.
To monitor progress:
# Run with verbose output
rm -rf .venv
python3 -m venv .venv
source .venv/bin/activate
pip install -e . -v # Verbose mode shows progressIf truly stuck (>10 minutes):
# Ctrl+C to cancel
# Check disk space
df -h .
# Check network
curl -I https://pypi.org/simple/
# Try with timeout
pip install -e . --timeout 60Symptom: After git merge, requirements.txt has conflict markers:
<<<<<<< HEAD
package==1.0.0
=======
package==2.0.0
>>>>>>> branch
Solution: Never manually edit requirements files. Regenerate from poetry.lock:
# Don't manually resolve conflict
# Instead, resolve poetry.lock conflict first
git checkout --theirs poetry.lock # Or --ours, depending on which you want
# Then regenerate requirements
make requirements
# Stage the regenerated files
git add requirements.txt requirements-dev.txt poetry.lockSymptom: make test passes locally, but GitHub Actions fails.
Common causes:
-
Missing dependency in poetry.lock: CI installs from lock file only
# Add missing dependency poetry add missing-package poetry lock make requirements git add pyproject.toml poetry.lock requirements*.txt git commit
-
Local vs CI Python version: CI tests on 3.9, 3.10, 3.11
# Test with specific Python version locally (requires pyenv) pyenv install 3.9.18 pyenv local 3.9.18 rm -rf .venv make setup make test
-
Environment variable differences: CI doesn't have your local
.env# Check which env vars your code uses grep -r "os.environ" datawagon/ # Ensure tests mock external dependencies # (GCS, BigQuery, etc.)
Ensure credentials are configured:
# Option 1: Application default credentials
gcloud auth application-default login
# Option 2: Service account key
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.jsonEnsure dependencies are installed:
poetry installCreate the dataset before running BigQuery commands:
bq mk --dataset your-project-id:your_dataset_nameEnsure you have BigQuery permissions:
# Authenticate with GCP
gcloud auth application-default login
# Verify you have BigQuery access
bq ls --project_id=your-project-idRequired IAM roles:
roles/bigquery.dataEditor- Create and manage tablesroles/storage.objectViewer- Read CSV files from GCS
Symptom: setup-venv.bat fails with "'python' is not recognized as an internal or external command".
Solution: Python is not in your PATH.
REM Check if Python is installed
python --version
REM If command not found, add Python to PATH:
REM 1. Find Python installation (usually C:\Python39 or C:\Users\<user>\AppData\Local\Programs\Python\Python39)
REM 2. Add to PATH via System Properties > Environment Variables
REM 3. Or use Python Launcher (py command):
py --version
py -3.9 -m venv .venvCommon causes:
- Python installed but not added to PATH during installation
- Using wrong Python launcher (python vs py)
- Multiple Python versions conflicting
Symptom: After setup, .venv\Scripts\activate.bat command not found.
Solution:
REM Check if .venv was created
dir .venv\Scripts
REM Activate using full path
.venv\Scripts\activate.bat
REM Or navigate first
cd .venv\Scripts
activate.bat
cd ..\..Symptom: Using PowerShell instead of cmd.exe, get "cannot be loaded because running scripts is disabled".
Solution: Our scripts are .bat files (cmd.exe), not .ps1 (PowerShell).
REM Use cmd.exe (Command Prompt), not PowerShell
REM Or in PowerShell, run cmd first:
cmd
setup-venv.batIf you need PowerShell support, request it via GitHub issue.
Symptom: update-venv.bat fails with "'git' is not recognized".
Solution: Install Git for Windows or use GitHub Desktop.
REM Install Git: https://git-scm.com/download/win
REM Verify installation:
git --version
REM Or use GitHub Desktop GUI instead of update script- Fork the repository
- Create a feature branch
- Make changes
- Run
make pre-committo verify quality - Commit changes
- Push and create pull request
See CHANGELOG.md for a detailed history of changes and version releases.
MIT License - see LICENSE file for details
For issues and questions:
- GitHub Issues: https://github.com/jtmcn/datawagon/issues
- Email: jtmcn.dev@gmail.com