An OLAP (Online Analytical Processing) database implementation for voyla.world that enables analytical queries across user behavior, content, places, and properties.
This project implements a complete data warehouse solution that:
- Extracts data from an OLTP (Online Transaction Processing) source database
- Transforms the data into a star schema optimized for analytics
- Loads the processed data into an OLAP PostgreSQL database
- Visualizes insights via a dashboard showing user clusters based on content, place, and property dimensions
The OLAP schema uses a star schema design with:
- Fact Table: Central table linking users, content, places, and properties
- Dimension Tables:
users- User informationcontent- Post/content data (likes, upload time, comments, social media metrics)places- Geographic locations (generated via GenAI and Google Maps API)property- Property/listing information
For a visual overview, see sql/OLAP schema.jpg
- Python 3.13 or higher
- PostgreSQL (for OLAP database)
- Git
- uv for python dependency management
- Clone the repository:
git clone https://github.com/sanchitram1/215-project-part-2.git
cd 215-project-part-2- Set up Python environment with uv:
uv sync- Configure database connection:
Create a
.envfile in the project root (or set environment variables):
cp .env.example .envThen set the environment variable values in .env
To do this, you need to run a local instance of Postgres on your computer. Using
xc trviailizes this, as you can just run xc db-start and any
of the xc tasks at the bottom of this README.
Execute the complete ETL pipeline:
uv run pipeline/main.pyThis will:
- Extract data from the OLTP source
- Transform and validate the data
- Load into the OLAP database
- Display progress and completion status
Code Formatting:
ruff format .Code Quality Checks:
ruff check . --fixRunning Tests:
pytest .Adding Dependencies:
uv add package_name├── pipeline/ # ETL pipeline modules
│ ├── extract.py # Data extraction from OLTP
│ ├── transform.py # Data transformation logic
│ ├── load.py # Data loading to OLAP
│ └── __init__.py
├── sql/
│ ├── schema.sql # OLAP database schema
│ └── OLAP schema.jpg # Visual schema diagram
├── tests/ # Unit and integration tests
│ ├── test_pipeline.py
│ ├── test_transform.py
│ └── __init__.py
├── main.py # Entry point - orchestrates ETL
├── pyproject.toml # Python project configuration
├── README.md # This file
└── AGENTS.md # Developer guide and coding standards
- Language: Python 3.13+
- Package Manager:
uv(modern, fast Python package manager) - Linting & Formatting:
ruff(fast Python linter and formatter) - Testing:
pytest(Python testing framework) - Database Driver:
psycopg2(PostgreSQL adapter), postgresql.org - Data Processing: pandas, numpy
Handles connection to OLTP source database and retrieves raw data. Returns pandas DataFrames with unmodified source data.
Applies business logic and data transformation:
- Data cleaning and validation
- Deduplication
- Aggregation for dimension tables
- Schema alignment with OLAP model
Loads transformed data into OLAP database:
- Inserts into dimension tables (users, content, places, property)
- Inserts into fact table
- Handles conflicts and duplicate prevention
- Logs success/failure status
Orchestrates the complete pipeline:
- Chains extract → transform → load
- Handles configuration and environment setup
- Reports overall status and metrics
Tests are organized by module and use pytest:
# Run all tests with coverage
pytest --cov=pipeline tests/
# Run specific test file
pytest tests/test_transform.py -v
# Run specific test function
pytest tests/test_transform.py::test_transform_users -vSee AGENTS.md for detailed coding standards including:
- Python style guidelines (PEP 8 with ruff enforcement)
- Function and class design patterns
- Database interaction best practices
- Testing requirements
- Documentation expectations
Database Connection Error:
- Verify PostgreSQL is running
- Check
.envfile has correct credentials - Ensure OLAP database is created:
psql -U postgres -c "CREATE DATABASE postgres"
Import Errors:
- Run
uv syncto ensure dependencies are installed - Check Python version:
python --version(should be 3.13+)
Test Failures:
- Ensure test database is accessible
- Run
ruff format .andruff check .to fix style issues - Check test logs for detailed error messages
- Create a feature branch:
git checkout -b feature/your-feature-name - Make changes following coding standards in
AGENTS.md - Format code:
ruff format . - Run tests:
pytest - Commit with clear messages
- Push and open a pull request
The dashboard provides visualization of user clusters based on:
- Content: Posts by user engagement metrics
- Place: Geographic distribution of interactions
- Property: Association with specific properties
Run the dashboard (implementation details in dashboard module):
python dashboard.pyProject for UC Berkeley IND ENG 215
For questions about coding standards or development workflow, refer to
AGENTS.md. For database schema details, see sql/OLAP schema.jpg.
mkdir -p data/db_files initdb -D data/db_filespg_ctl -D data/db_files -l data/logfile startpg_ctl -D data/db_files stopRequires: db-stop
rm -rf data/db_filesRequires: db-kill, db-create, db-start
psql $OLAP_DATABASE_URL -f sql/OLAP_schema.sqlruff format .ruff check . --fix