This project implements an OLAP (Online Analytical Processing) database system for voyla.world. The system extracts data from an OLTP (Online Transaction Processing) source and transforms it into a star schema optimized for analytical queries. The final deliverable includes an interactive dashboard showing user clusters based on content, place, and property appearance.
Key Components:
- OLAP Schema: Star schema with a central fact table and multiple dimension tables
- ETL Pipeline: Extract data from OLTP source and load into OLAP database using psycopg2
- Dashboard: Visualization of user clusters across content/place/property dimensions
- Python Version: 3.13+
- Package Manager:
uv(useuv add,uv syncinstead of pip) - Linting & Formatting:
ruff(useruff checkandruff format) - Testing:
pytestwith standard project layout - Database Driver:
psycopg2for PostgreSQL connections - Data Processing: pandas, numpy
215-project-part-2/
├── pipeline/ # ETL pipeline code
│ ├── extract.py # Data extraction from OLTP source
│ ├── transform.py # Data transformation logic
│ ├── load.py # Data loading into OLAP database
│ └── __init__.py
├── sql/
│ ├── schema.sql # OLAP schema definition (dimension & fact tables)
│ └── OLAP schema.jpg # Visual schema diagram
├── tests/ # Test files (pytest)
│ ├── test_pipeline.py
│ ├── test_transform.py
│ └── __init__.py
├── main.py # Entry point / orchestrator
├── pyproject.toml # Project dependencies and metadata
├── README.md # User-facing documentation
└── AGENTS.md # This file
Dimension Tables:
users- User dimension with user_idcontent- Content dimension with content_id, likes, upload_time, comments, social_media_dataplaces- Place dimension with place_id (from Google Maps/GenAI)property- Property dimension with property_id
Fact Table:
- Central fact table containing: user_id, content_id, place_id, property_id
- Links users, content, places, and properties together for analytical queries
-
Code Formatting: Run
ruff formatbefore committingruff format . -
Linting: Run
ruff checkto identify issuesruff check . ruff check --fix . # Auto-fix common issues
-
Style Guidelines:
- Follow PEP 8 conventions (ruff enforces this)
- Use descriptive variable and function names
- Add docstrings to all functions and classes
- Maximum line length: 88 characters (ruff default)
- Use type hints for function arguments and return values
-
Import Organization:
- Standard library imports first
- Third-party imports second (pandas, psycopg2, numpy)
- Local imports last
- Separate groups with blank lines
-
Database Functions:
- All database connections should use context managers
- Always close cursors after use
- Include error handling for connection failures
- Example:
def get_data_from_source(connection_params: dict) -> pd.DataFrame: """Fetch data from OLTP source.""" try: with psycopg2.connect(**connection_params) as conn: with conn.cursor() as cur: cur.execute("SELECT * FROM source_table") return pd.DataFrame(cur.fetchall()) except psycopg2.Error as e: raise ValueError(f"Database error: {e}")
-
ETL Functions:
- Separate extract, transform, and load logic into distinct modules
- Each function should have a single responsibility
- Pass DataFrames between functions (not raw database connections)
- Use pandas operations for data transformation
-
Error Handling:
- Catch specific exceptions (not generic
Exception) - Log errors with context before raising
- Provide meaningful error messages
- Catch specific exceptions (not generic
-
Test Structure:
- Place tests in
tests/directory withtest_prefix - Use pytest fixtures for setup/teardown
- Test file names should mirror module names (e.g.,
test_transform.pyforpipeline/transform.py)
- Place tests in
-
Test Coverage:
- Test extract functions with mocked database connections
- Test transform functions with sample data
- Test load functions with test database
-
Running Tests:
pytest # Run all tests pytest -v # Verbose output pytest tests/test_transform.py # Run specific test file
-
Module Docstrings:
""" Extract module for ETL pipeline. Handles connection to OLTP source and retrieves raw data for transformation. """
-
Function Docstrings:
def extract_users(connection_params: dict) -> pd.DataFrame: """ Extract user data from OLTP source. Args: connection_params: Dictionary with host, user, password, dbname Returns: DataFrame with columns: user_id, [user columns] Raises: ValueError: If database connection fails """
-
Comments:
- Use comments to explain why, not what
- Keep comments updated with code changes
uv add <package> # Add new dependency
uv sync # Install dependencies from lock file
uv lock # Update lock fileruff format . # Format all Python files
ruff check . --fix # Lint and auto-fixpytest # Run all tests
pytest -v # Verbose mode
pytest --cov # Generate coverage report (if pytest-cov installed)python main.py # Run full ETL pipeline- Create a branch for each feature/fix:
git checkout -b feature/descriptive-name - Commit with clear messages:
git commit -m "Add ETL extraction logic" - Before pushing, ensure:
- Code is formatted:
ruff format . - Code passes linting:
ruff check . - All tests pass:
pytest
- Code is formatted:
- Push and create pull request for team review
Connection parameters should be stored in environment variables or a configuration file:
import os
DB_PARAMS = {
"host": os.getenv("DB_HOST", "localhost"),
"user": os.getenv("DB_USER", "postgres"),
"password": os.getenv("DB_PASSWORD"),
"dbname": os.getenv("DB_NAME", "olap_db"),
}- Use pandas
head(),info(), anddescribe()to inspect DataFrames during development - Test transformations on small subsets before running on full datasets
- Log progress during long-running pipeline operations
- Keep source data separate from OLAP data (never modify source directly)
Refer to project documentation in README.md or check the schema diagram in sql/OLAP schema.jpg