Skip to content

Add GA4 data platform skeleton#6

Merged
RimSM merged 1 commit into
mainfrom
chore/data-platform-skeleton
May 31, 2026
Merged

Add GA4 data platform skeleton#6
RimSM merged 1 commit into
mainfrom
chore/data-platform-skeleton

Conversation

@RimSM

@RimSM RimSM commented May 31, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add Docker Compose skeleton for Airflow, dbt, Superset, and Postgres
  • Add dbt project structure for GA4 BigQuery modeling
  • Add GA4 source/staging/bronze/silver/marts model skeleton
  • Add Airflow hourly dbt DAG
  • Add operation and architecture docs

Validation

  • docker compose config --quiet

Summary by CodeRabbit

  • New Features

    • Added hourly GA4 event processing pipeline with automated dbt transformations.
    • Created analytics data marts for daily conversions, events, and pageviews aggregations.
    • Built session and user-level analytics models for deeper insights.
  • Documentation

    • Added comprehensive README with project overview and quick-start guide.
    • Documented system architecture and operational runbooks.
  • Chores

    • Set up Docker-based multi-service stack (Airflow, PostgreSQL, Superset).
    • Configured environment variables and database initialization.
    • Added utility scripts for backups and dbt operations.

@coderabbitai

coderabbitai Bot commented May 31, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@RimSM, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 25 minutes and 14 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 592768d3-588f-4d5e-b85f-d73747905819

📥 Commits

Reviewing files that changed from the base of the PR and between 9a5dd0a and 66be308.

📒 Files selected for processing (30)
  • .env.example
  • .gitignore
  • AGENTS.md
  • README.md
  • airflow/dags/.gitkeep
  • airflow/plugins/.gitkeep
  • dbt/koin/.gitkeep
  • dbt/koin/analyses/.gitkeep
  • dbt/koin/macros/.gitkeep
  • dbt/koin/models/bronze/.gitkeep
  • dbt/koin/models/marts/.gitkeep
  • dbt/koin/models/silver/.gitkeep
  • dbt/koin/models/sources/.gitkeep
  • dbt/koin/models/staging/.gitkeep
  • dbt/koin/seeds/.gitkeep
  • dbt/koin/snapshots/.gitkeep
  • dbt/koin/tests/.gitkeep
  • docker-compose.yml
  • docker/airflow/Dockerfile
  • docker/airflow/requirements.txt
  • docker/postgres/init-databases.sh
  • docker/superset/Dockerfile
  • docker/superset/requirements.txt
  • docker/superset/superset_config.py
  • docs/.gitkeep
  • docs/architecture.md
  • docs/operations.md
  • scripts/.gitkeep
  • scripts/backup_postgres.sh
  • scripts/init_airflow.sh
📝 Walkthrough

Walkthrough

This PR establishes a complete GA4 data platform monorepo. It adds dbt project configuration, a multi-layer data transformation pipeline (staging→bronze→silver→marts), Docker Compose infrastructure for Airflow and Superset, and operational scripts. The stack enables hourly GA4 event ingestion from BigQuery through incremental dbt models into analytics-ready tables.

Changes

GA4 Data Platform Setup

Layer / File(s) Summary
Project configuration & environment
.env.example, .gitignore, README.md, AGENTS.md, dbt/bcsd_data/dbt_project.yml, dbt/bcsd_data/packages.yml, dbt/bcsd_data/profiles.yml.example, docs/architecture.md, docs/operations.md
dbt project metadata (layering, incremental settings, lookback_hours variable), environment variable templates for Airflow/Superset/GCP, repository conventions and documentation for GA4 data stack architecture and operations.
dbt macros & staging layer
dbt/bcsd_data/macros/ga4_events_source.sql, dbt/bcsd_data/macros/generate_schema_name.sql, dbt/bcsd_data/models/staging/ga4/stg_ga4__events.sql, dbt/bcsd_data/models/staging/ga4/schema.yml, dbt/bcsd_data/models/sources/ga4_sources.yml
Reusable macros for GA4 BigQuery table filtering and schema name mapping; staging model that normalizes raw GA4 events (timestamp casting, field mapping, not_null tests).
Bronze layer GA4 events
dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sql, dbt/bcsd_data/models/bronze/ga4/schema.yml
Deduplicated event records with computed MD5 event_key, parsed date, load timestamp, partitioned by date, clustered by event/user, incremental by lookback_hours with uniqueness and not_null tests.
Silver layer dimensional models
dbt/bcsd_data/models/silver/ga4/silver_ga4__events.sql, dbt/bcsd_data/models/silver/ga4/silver_ga4__event_params.sql, dbt/bcsd_data/models/silver/ga4/silver_ga4__pageviews.sql, dbt/bcsd_data/models/silver/ga4/silver_ga4__sessions.sql, dbt/bcsd_data/models/silver/ga4/silver_ga4__users.sql, dbt/bcsd_data/models/silver/ga4/silver_ga4__conversions.sql, dbt/bcsd_data/models/silver/ga4/schema.yml
Dimensional tables: events (incremental), flattened event parameters with aggregated values, page-view events (left-join params for page fields), sessions (grouped by ga_session_id), user aggregates (first/last seen, event count, active days), and conversion filters. All support incremental processing with lookback windows.
Analytics marts (gold layer)
dbt/bcsd_data/models/marts/marketing/mart_ga4__daily_events.sql, dbt/bcsd_data/models/marts/product/mart_ga4__daily_pageviews.sql, dbt/bcsd_data/models/marts/ecommerce/mart_ga4__daily_conversions.sql, dbt/bcsd_data/models/marts/schema.yml
Daily aggregated tables grouped by event date and domain-specific dimensions (traffic source/medium/campaign, page location, page title, event name), computing event counts and distinct user counts.
Docker infrastructure
docker-compose.yml, docker/airflow/Dockerfile, docker/airflow/requirements.txt, docker/postgres/init-databases.sh, docker/superset/Dockerfile, docker/superset/requirements.txt, docker/superset/superset_config.py
Multi-container stack: Postgres 16 for metadata with initialization script, Airflow 2.9.3 with dbt/Google Cloud providers, Superset 4.0.2 with BigQuery driver. All services share Postgres metadata store and mount dbt/secrets/logs volumes.
Airflow orchestration & scripts
airflow/dags/ga4_hourly_dbt.py, scripts/init_airflow.sh, scripts/run_dbt.sh, scripts/backup_postgres.sh
Hourly DAG with sequential dbt deps and dbt build tag:ga4 tasks, plus helper scripts: Airflow initialization, dbt command runner, Postgres backup utility.
Directory placeholders
airflow/plugins/.gitkeep, dbt/bcsd_data/analyses/.gitkeep, dbt/bcsd_data/macros/.gitkeep, dbt/bcsd_data/seeds/.gitkeep, dbt/bcsd_data/snapshots/.gitkeep, dbt/bcsd_data/tests/.gitkeep, docs/.gitkeep, scripts/.gitkeep
Empty .gitkeep files preserving directory structure in version control.

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes


🐰 A platform born from GA4's endless stream,
Through dbt layers we transform the dream—
Bronze to silver, then gold so bright,
Airflow keeps the schedule tight,
Superset shows what's in the night! ✨📊

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add GA4 data platform skeleton' directly summarizes the main change: introducing a complete GA4 data platform infrastructure with Docker, dbt, Airflow, and Superset components.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/data-platform-skeleton

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@STY0978 STY0978 self-requested a review May 31, 2026 12:26

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (7)
dbt/bcsd_data/models/sources/ga4_sources.yml (1)

8-14: ⚖️ Poor tradeoff

Declared source is not actually referenced; lineage is bypassed.

The ga4 source is defined here, but ga4_events_source_sql() queries `{{ project }}.{{ dataset }}.events_*` directly rather than via {{ source('ga4', 'events') }}. As a result dbt's lineage graph, source freshness checks, and the Downstream models should not query raw GA4 export tables directly guideline are not enforced for this path. Consider routing the wildcard scan through the declared source so the dependency is tracked.

As per coding guidelines: "Downstream models should not query raw GA4 export tables directly."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbt/bcsd_data/models/sources/ga4_sources.yml` around lines 8 - 14, The ga4
source is declared but ga4_events_source_sql() currently queries `{{ project
}}.{{ dataset }}.events_*` directly, bypassing dbt lineage and freshness; update
ga4_events_source_sql() to reference the declared source by using {{
source('ga4', 'events') }} (and similarly handle intraday if applicable) and
construct the wildcard scan from that source reference instead of hardcoding
project/dataset/table patterns so dbt tracks the dependency and enforces source
checks.
dbt/bcsd_data/macros/ga4_events_source.sql (1)

4-10: 💤 Low value

Lookback conversion drops sub-day granularity; intended as a buffer?

(lookback_hours | int) // 24) + 3 floors any value under 24h to 0, so the effective table-pruning window is always at least 3 days regardless of lookback_hours. This is fine as a partition-pruning safety margin (the precise hour filter happens downstream in bronze), but the relationship is non-obvious. A short comment would help future maintainers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbt/bcsd_data/macros/ga4_events_source.sql` around lines 4 - 10, The
lookback_days calculation (variables lookback_hours and lookback_days)
intentionally floors lookback_hours to days via ((lookback_hours | int) // 24) +
3 which causes any <24h value to become 3 days as a safety margin; add a short
clarifying comment above these lines explaining that this behavior is deliberate
(to ensure partition-pruning safety) and that finer hourly filtering is applied
later in the bronze pipeline, so maintainers understand the apparent mismatch
between lookback_hours and table-pruning window.
dbt/bcsd_data/models/silver/ga4/schema.yml (1)

25-28: ⚡ Quick win

Add event_key tests for silver_ga4__pageviews and silver_ga4__conversions.

Both models declare unique_key="event_key" but, unlike the other silver models, have no not_null/unique tests. Adding them validates the incremental grain and catches duplicate keys early.

♻️ Add column tests
   - name: silver_ga4__pageviews
     description: "Page view events with page parameters."
+    columns:
+      - name: event_key
+        tests:
+          - not_null
+          - unique
   - name: silver_ga4__conversions
     description: "Conversion events configured by `conversion_events` var."
+    columns:
+      - name: event_key
+        tests:
+          - not_null
+          - unique
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbt/bcsd_data/models/silver/ga4/schema.yml` around lines 25 - 28, The
schema.yml is missing column tests for event_key on models silver_ga4__pageviews
and silver_ga4__conversions even though both use unique_key="event_key"; add dbt
column tests by adding not_null and unique tests for the column event_key under
each model entry (silver_ga4__pageviews and silver_ga4__conversions) so
duplicates and nulls are caught during build and validate the incremental grain.
dbt/bcsd_data/models/silver/ga4/silver_ga4__pageviews.sql (1)

9-37: ⚡ Quick win

No incremental filter → full rebuild of pageviews each hourly run.

This model carries an incremental config (unique_key, partition_by) but, unlike silver_ga4__events, has no is_incremental() lookback guard. On every hourly run it scans the entire silver_ga4__events table and pivots the full silver_ga4__event_params table, which scans all params for all event types (not just page_view). On BigQuery this is a recurring full-table cost that grows with history. Since this is a 1:1 row mapping from events, add a lookback filter mirroring the upstream model.

♻️ Add incremental lookback and constrain the param pivot
 with events as (
   select * from {{ ref('silver_ga4__events') }}
+  {% if is_incremental() %}
+  where event_timestamp >= timestamp_sub(current_timestamp(), interval {{ var('lookback_hours', 6) }} hour)
+  {% endif %}
 ),

 params as (
   select * from {{ ref('silver_ga4__event_params') }}
+  {% if is_incremental() %}
+  where event_timestamp >= timestamp_sub(current_timestamp(), interval {{ var('lookback_hours', 6) }} hour)
+  {% endif %}
 ),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbt/bcsd_data/models/silver/ga4/silver_ga4__pageviews.sql` around lines 9 -
37, The model lacks an is_incremental() lookback so hourly runs rebuild/pivot
all events and params; update the SQL to apply the same incremental filter used
in silver_ga4__events: in the events CTE add a guard like where (not
is_incremental() or event_date_dt >= date_sub(current_date(), interval 1 day))
(or mirror the exact partition lookback used upstream), and restrict the pivoted
CTE to only params for those events (e.g., build pivoted from params join events
using event_key or add where param_event_key in (select event_key from events)
and also filter param_key to only the three page_* keys) so only recent
page_view events and their params are scanned during incremental runs.
dbt/bcsd_data/models/marts/schema.yml (1)

3-9: ⚡ Quick win

Add basic column tests for mart outputs.

Descriptions are helpful, but marts feeding BI should include at least not_null tests on date/metric columns to catch silent data regressions early.

Proposed schema test scaffold
 version: 2

 models:
   - name: mart_ga4__daily_events
     description: "Daily event counts for Superset."
+    columns:
+      - name: event_date_dt
+        tests: [not_null]
+      - name: event_count
+        tests: [not_null]
   - name: mart_ga4__daily_pageviews
     description: "Daily pageview counts for Superset."
+    columns:
+      - name: event_date_dt
+        tests: [not_null]
+      - name: pageview_count
+        tests: [not_null]
+      - name: user_count
+        tests: [not_null]
   - name: mart_ga4__daily_conversions
     description: "Daily conversion counts for Superset."
+    columns:
+      - name: event_date_dt
+        tests: [not_null]
+      - name: conversion_count
+        tests: [not_null]
+      - name: user_count
+        tests: [not_null]

Based on learnings: "Superset should read only marts or gold tables from dbt".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbt/bcsd_data/models/marts/schema.yml` around lines 3 - 9, Add basic dbt
column tests (not_null) to each mart model to prevent silent regressions: for
mart_ga4__daily_events, mart_ga4__daily_pageviews, and
mart_ga4__daily_conversions, declare a columns block in schema.yml and add
not_null tests for the date column and for the key metric columns (e.g., event
count, pageviews, conversions — replace with the actual metric column names used
in each model). Ensure each model entry uses the exact model names
mart_ga4__daily_events, mart_ga4__daily_pageviews, mart_ga4__daily_conversions
and add tests under columns: - name: <date_column> tests: - not_null and - name:
<metric_column> tests: - not_null so CI/dbt will fail when these critical fields
are null.
docker-compose.yml (1)

11-13: ⚡ Quick win

Celery result backend is set but no Celery executor/broker is configured.

AIRFLOW__CELERY__RESULT_BACKEND is defined, yet there is no AIRFLOW__CORE__EXECUTOR, broker, or worker service. Airflow defaults to SequentialExecutor, so this setting is unused and misleading. For a Postgres-backed single-host MVP, set LocalExecutor explicitly (and drop the unused Celery backend) to get parallel task execution and a clear config intent.

♻️ Suggested change
   environment:
+    AIRFLOW__CORE__EXECUTOR: LocalExecutor
     AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://${BCSD_DATA_POSTGRES_USER:-bcsd}:${BCSD_DATA_POSTGRES_PASSWORD:-change_me}`@postgres`:5432/${BCSD_DATA_AIRFLOW_DB:-airflow_metadata}
-    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://${BCSD_DATA_POSTGRES_USER:-bcsd}:${BCSD_DATA_POSTGRES_PASSWORD:-change_me}`@postgres`:5432/${BCSD_DATA_AIRFLOW_DB:-airflow_metadata}
     DBT_PROFILES_DIR: /opt/airflow/dbt/bcsd_data
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docker-compose.yml` around lines 11 - 13, The compose environment sets
AIRFLOW__CELERY__RESULT_BACKEND but no Celery broker/executor/worker exists;
remove the unused AIRFLOW__CELERY__RESULT_BACKEND entry and instead explicitly
set AIRFLOW__CORE__EXECUTOR=LocalExecutor next to
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN so Airflow runs with LocalExecutor for
parallel tasks; update the service environment to drop the Celery backend
variable and add the AIRFLOW__CORE__EXECUTOR key (referencing the
AIRFLOW__CELERY__RESULT_BACKEND and AIRFLOW__CORE__EXECUTOR variables to find
where to change).
airflow/dags/ga4_hourly_dbt.py (1)

18-29: ⚡ Quick win

Add retries and an execution timeout for the dbt tasks.

Both BashOperators invoke external systems (dbt deps fetches packages over the network; dbt build hits BigQuery). Without retries/retry_delay a transient network or BQ hiccup fails the hourly run, and without execution_timeout a hung dbt invocation can block the single allowed active run indefinitely (max_active_runs=1). Add default_args to the DAG.

♻️ Suggested change
+from datetime import datetime, timedelta
-from datetime import datetime
 ...
 with DAG(
     dag_id="ga4_hourly_dbt",
     description="Run dbt models for recent GA4 BigQuery export data.",
     start_date=datetime(2026, 1, 1),
     schedule="`@hourly`",
     catchup=False,
     max_active_runs=1,
+    default_args={
+        "retries": 2,
+        "retry_delay": timedelta(minutes=5),
+        "execution_timeout": timedelta(minutes=30),
+    },
     tags=["ga4", "dbt", "bigquery"],
 ) as dag:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@airflow/dags/ga4_hourly_dbt.py` around lines 18 - 29, Add durable retry and
timeout settings by defining default_args on the DAG and applying them to the
dbt BashOperators: create a default_args dict (e.g., include retries,
retry_delay using datetime.timedelta, and execution_timeout using
datetime.timedelta) and pass it into the DAG constructor (or add the same keys
to the dbt_deps and dbt_build operator instantiations). Specifically ensure
dbt_deps and dbt_build get retries (e.g., 1–3), a sensible retry_delay (e.g., 5
minutes) and an execution_timeout (e.g., 15–30 minutes) so transient
network/BigQuery failures and hung dbt calls don’t kill or block hourly runs
(respecting max_active_runs=1).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.env.example:
- Around line 14-31: Update the .env.example to use the mandated BCSD_DATA_
prefix for all configuration keys instead of raw AIRFLOW_* and SUPERSET_* names
(e.g., replace AIRFLOW__CORE__EXECUTOR,
AIRFLOW_ADMIN_USERNAME/PASSWORD/FIRSTNAME/LASTNAME/EMAIL and
SUPERSET_SECRET_KEY, SUPERSET_ADMIN_* with BCSD_DATA_ prefixed equivalents), and
ensure the Compose/startup layer maps those BCSD_DATA_* env vars back to the
native names when launching services (so look for places that read AIRFLOW__* or
SUPERSET_* and change them to use BCSD_DATA_* mapping at service startup).

In `@dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sql`:
- Around line 15-21: The current event_key hash (constructed from
event_timestamp, user_pseudo_id, event_name) is not sufficiently unique; update
the event_key generation in bronze_ga4__events.sql (the to_hex(md5(concat(...)))
expression) to include additional disambiguating identifiers such as
event_bundle_sequence_id and/or ga_session_id, or incorporate a deterministic
row discriminator (e.g., a row_number() over the natural key partition) into the
concatenated string so the resulting event_key is truly unique for each physical
event; ensure this change aligns with the unique_key used by the incremental
merge and the unique test in schema.yml.
- Around line 1-7: The config block incorrectly claims
materialized='incremental' is needed; remove that misleading claim and rely on
the inherited +materialized: incremental so is_incremental() continues to work;
also add a short inline comment near the config or the event_key definition
(event_key = md5(event_timestamp || user_pseudo_id || event_name)) calling out
the collision/data-loss risk when using incremental_strategy: merge with a
unique test on bronze_ga4__events.event_key so reviewers know duplicates at that
grain may be collapsed silently.

In `@dbt/bcsd_data/models/silver/ga4/silver_ga4__sessions.sql`:
- Around line 25-37: The current session_key uses coalesce(sp.ga_session_id, '')
which collapses all null-session events for a user into one synthetic session;
change the logic in the joined CTE to either exclude rows where sp.ga_session_id
IS NULL from the session aggregation (i.e., add WHERE sp.ga_session_id IS NOT
NULL before grouping) or make null sessions unique per event/window by replacing
coalesce(sp.ga_session_id, '') with a distinct fallback (for example using
sp.ga_session_id || '|' || e.event_key or including a truncated
event_timestamp/event_date) so that session_key (the to_hex(md5(...))
expression) no longer conflates unrelated events for the same user_pseudo_id.

In `@dbt/bcsd_data/models/staging/ga4/schema.yml`:
- Around line 16-18: The current schema test enforces not_null on
stg_ga4__events.user_pseudo_id which fails when GA4 Advanced Consent Mode emits
cookieless pings; update the test in schema.yml for the field user_pseudo_id
(under the stg_ga4__events model) to relax the constraint—either remove the
not_null test or change it to a warning (e.g., set not_null: { severity: warn })
so null user_pseudo_id values no longer fail the build.

In `@docker-compose.yml`:
- Around line 58-73: The webserver and scheduler services currently only depend
on the database health and can start before the one-time init job finishes;
update both airflow-webserver and airflow-scheduler service blocks to add an
explicit depends_on entry for airflow-init with the
service_completed_successfully condition (e.g., depends_on: airflow-init:
condition: service_completed_successfully) so they wait for the init job
(airflow db migrate + admin creation) to finish before starting; keep the
existing postgres service_healthy dependency as well.

In `@docker/superset/superset_config.py`:
- Line 3: SECRET_KEY currently defaults to an insecure literal; change behavior
in superset_config.py so that SECRET_KEY is obtained from
os.environ["SUPERSET_SECRET_KEY"] (or explicitly validated) and the process
fails fast with a clear error if the variable is missing/empty instead of
falling back to "change_me"; update the SECRET_KEY assignment and add a check
that raises a RuntimeError (or exits) with a descriptive message referencing
SUPERSET_SECRET_KEY so deployments cannot accidentally run with a public key.

In `@scripts/backup_postgres.sh`:
- Line 4: The BACKUP_DIR env var doesn't follow the required BCSD_DATA_ prefix;
rename the variable to BCSD_DATA_BACKUP_DIR and update its default assignment
(replace BACKUP_DIR="${BACKUP_DIR:-backups/postgres}" with
BCSD_DATA_BACKUP_DIR="${BCSD_DATA_BACKUP_DIR:-backups/postgres}"), then replace
all uses of BACKUP_DIR in the script with BCSD_DATA_BACKUP_DIR and update any
related docs/env examples to the new variable name.

In `@scripts/run_dbt.sh`:
- Line 7: The docker compose invocation in scripts/run_dbt.sh hardcodes
/opt/airflow/dbt/bcsd_data; change it to use a path relative to the
repository/script location instead: compute the script directory (use the
directory of scripts/run_dbt.sh as the base) and build a relative path to
dbt/bcsd_data, then use that variable in the docker compose run ... bash -lc "cd
<relative-path> && dbt ${COMMAND} $*"; update the docker compose command to
reference this relative-path variable (and keep using the existing COMMAND and
argument expansion).
- Line 7: The current line embeds unquoted ${COMMAND} and $* into a
double-quoted bash -lc string, enabling argument injection and word-splitting;
change the invocation to pass arguments as argv and quote them: stop
interpolating ${COMMAND} and $* inside the command string, use "$COMMAND" and
"$@" (not $*) as positional parameters, call bash -lc with a single-quoted
command that uses exec dbt "$@" (or cd ... && exec dbt "$@"), and pass the
command and remaining args after a -- so the container receives them as argv
rather than being re-parsed by the outer shell.

---

Nitpick comments:
In `@airflow/dags/ga4_hourly_dbt.py`:
- Around line 18-29: Add durable retry and timeout settings by defining
default_args on the DAG and applying them to the dbt BashOperators: create a
default_args dict (e.g., include retries, retry_delay using datetime.timedelta,
and execution_timeout using datetime.timedelta) and pass it into the DAG
constructor (or add the same keys to the dbt_deps and dbt_build operator
instantiations). Specifically ensure dbt_deps and dbt_build get retries (e.g.,
1–3), a sensible retry_delay (e.g., 5 minutes) and an execution_timeout (e.g.,
15–30 minutes) so transient network/BigQuery failures and hung dbt calls don’t
kill or block hourly runs (respecting max_active_runs=1).

In `@dbt/bcsd_data/macros/ga4_events_source.sql`:
- Around line 4-10: The lookback_days calculation (variables lookback_hours and
lookback_days) intentionally floors lookback_hours to days via ((lookback_hours
| int) // 24) + 3 which causes any <24h value to become 3 days as a safety
margin; add a short clarifying comment above these lines explaining that this
behavior is deliberate (to ensure partition-pruning safety) and that finer
hourly filtering is applied later in the bronze pipeline, so maintainers
understand the apparent mismatch between lookback_hours and table-pruning
window.

In `@dbt/bcsd_data/models/marts/schema.yml`:
- Around line 3-9: Add basic dbt column tests (not_null) to each mart model to
prevent silent regressions: for mart_ga4__daily_events,
mart_ga4__daily_pageviews, and mart_ga4__daily_conversions, declare a columns
block in schema.yml and add not_null tests for the date column and for the key
metric columns (e.g., event count, pageviews, conversions — replace with the
actual metric column names used in each model). Ensure each model entry uses the
exact model names mart_ga4__daily_events, mart_ga4__daily_pageviews,
mart_ga4__daily_conversions and add tests under columns: - name: <date_column>
tests: - not_null and - name: <metric_column> tests: - not_null so CI/dbt will
fail when these critical fields are null.

In `@dbt/bcsd_data/models/silver/ga4/schema.yml`:
- Around line 25-28: The schema.yml is missing column tests for event_key on
models silver_ga4__pageviews and silver_ga4__conversions even though both use
unique_key="event_key"; add dbt column tests by adding not_null and unique tests
for the column event_key under each model entry (silver_ga4__pageviews and
silver_ga4__conversions) so duplicates and nulls are caught during build and
validate the incremental grain.

In `@dbt/bcsd_data/models/silver/ga4/silver_ga4__pageviews.sql`:
- Around line 9-37: The model lacks an is_incremental() lookback so hourly runs
rebuild/pivot all events and params; update the SQL to apply the same
incremental filter used in silver_ga4__events: in the events CTE add a guard
like where (not is_incremental() or event_date_dt >= date_sub(current_date(),
interval 1 day)) (or mirror the exact partition lookback used upstream), and
restrict the pivoted CTE to only params for those events (e.g., build pivoted
from params join events using event_key or add where param_event_key in (select
event_key from events) and also filter param_key to only the three page_* keys)
so only recent page_view events and their params are scanned during incremental
runs.

In `@dbt/bcsd_data/models/sources/ga4_sources.yml`:
- Around line 8-14: The ga4 source is declared but ga4_events_source_sql()
currently queries `{{ project }}.{{ dataset }}.events_*` directly, bypassing dbt
lineage and freshness; update ga4_events_source_sql() to reference the declared
source by using {{ source('ga4', 'events') }} (and similarly handle intraday if
applicable) and construct the wildcard scan from that source reference instead
of hardcoding project/dataset/table patterns so dbt tracks the dependency and
enforces source checks.

In `@docker-compose.yml`:
- Around line 11-13: The compose environment sets
AIRFLOW__CELERY__RESULT_BACKEND but no Celery broker/executor/worker exists;
remove the unused AIRFLOW__CELERY__RESULT_BACKEND entry and instead explicitly
set AIRFLOW__CORE__EXECUTOR=LocalExecutor next to
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN so Airflow runs with LocalExecutor for
parallel tasks; update the service environment to drop the Celery backend
variable and add the AIRFLOW__CORE__EXECUTOR key (referencing the
AIRFLOW__CELERY__RESULT_BACKEND and AIRFLOW__CORE__EXECUTOR variables to find
where to change).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3709be80-f7a1-4a19-b866-33971abab53d

📥 Commits

Reviewing files that changed from the base of the PR and between b960e60 and 9a5dd0a.

📒 Files selected for processing (46)
  • .env.example
  • .gitignore
  • AGENTS.md
  • README.md
  • airflow/dags/ga4_hourly_dbt.py
  • airflow/plugins/.gitkeep
  • dbt/bcsd_data/analyses/.gitkeep
  • dbt/bcsd_data/dbt_project.yml
  • dbt/bcsd_data/macros/.gitkeep
  • dbt/bcsd_data/macros/ga4_events_source.sql
  • dbt/bcsd_data/macros/generate_schema_name.sql
  • dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sql
  • dbt/bcsd_data/models/bronze/ga4/schema.yml
  • dbt/bcsd_data/models/marts/ecommerce/mart_ga4__daily_conversions.sql
  • dbt/bcsd_data/models/marts/marketing/mart_ga4__daily_events.sql
  • dbt/bcsd_data/models/marts/product/mart_ga4__daily_pageviews.sql
  • dbt/bcsd_data/models/marts/schema.yml
  • dbt/bcsd_data/models/silver/ga4/schema.yml
  • dbt/bcsd_data/models/silver/ga4/silver_ga4__conversions.sql
  • dbt/bcsd_data/models/silver/ga4/silver_ga4__event_params.sql
  • dbt/bcsd_data/models/silver/ga4/silver_ga4__events.sql
  • dbt/bcsd_data/models/silver/ga4/silver_ga4__pageviews.sql
  • dbt/bcsd_data/models/silver/ga4/silver_ga4__sessions.sql
  • dbt/bcsd_data/models/silver/ga4/silver_ga4__users.sql
  • dbt/bcsd_data/models/sources/ga4_sources.yml
  • dbt/bcsd_data/models/staging/ga4/schema.yml
  • dbt/bcsd_data/models/staging/ga4/stg_ga4__events.sql
  • dbt/bcsd_data/packages.yml
  • dbt/bcsd_data/profiles.yml.example
  • dbt/bcsd_data/seeds/.gitkeep
  • dbt/bcsd_data/snapshots/.gitkeep
  • dbt/bcsd_data/tests/.gitkeep
  • docker-compose.yml
  • docker/airflow/Dockerfile
  • docker/airflow/requirements.txt
  • docker/postgres/init-databases.sh
  • docker/superset/Dockerfile
  • docker/superset/requirements.txt
  • docker/superset/superset_config.py
  • docs/.gitkeep
  • docs/architecture.md
  • docs/operations.md
  • scripts/.gitkeep
  • scripts/backup_postgres.sh
  • scripts/init_airflow.sh
  • scripts/run_dbt.sh

Comment thread .env.example
Comment on lines +14 to +31
AIRFLOW_UID=50000
AIRFLOW__CORE__EXECUTOR=LocalExecutor
AIRFLOW__CORE__LOAD_EXAMPLES=false
AIRFLOW__WEBSERVER__EXPOSE_CONFIG=false
AIRFLOW__WEBSERVER__SECRET_KEY=change_me
AIRFLOW_ADMIN_USERNAME=admin
AIRFLOW_ADMIN_PASSWORD=admin
AIRFLOW_ADMIN_FIRSTNAME=BCSD
AIRFLOW_ADMIN_LASTNAME=Admin
AIRFLOW_ADMIN_EMAIL=admin@example.com

# Superset
SUPERSET_SECRET_KEY=change_me
SUPERSET_ADMIN_USERNAME=admin
SUPERSET_ADMIN_PASSWORD=admin
SUPERSET_ADMIN_FIRSTNAME=BCSD
SUPERSET_ADMIN_LASTNAME=Admin
SUPERSET_ADMIN_EMAIL=admin@example.com

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Unprefixed configuration variables break the repo env naming rule.

AIRFLOW_* / SUPERSET_* keys in this template violate the required BCSD_DATA_ prefix convention for configuration. Please expose prefixed config keys here and map them to tool-native env names where they are consumed (e.g., Compose/service startup layer).

As per coding guidelines: "Use BCSD_DATA_ as the environment variable prefix for all configuration".

🧰 Tools
🪛 dotenv-linter (4.0.0)

[warning] 19-19: [UnorderedKey] The AIRFLOW_ADMIN_USERNAME key should go before the AIRFLOW_UID key

(UnorderedKey)


[warning] 20-20: [UnorderedKey] The AIRFLOW_ADMIN_PASSWORD key should go before the AIRFLOW_ADMIN_USERNAME key

(UnorderedKey)


[warning] 21-21: [UnorderedKey] The AIRFLOW_ADMIN_FIRSTNAME key should go before the AIRFLOW_ADMIN_PASSWORD key

(UnorderedKey)


[warning] 22-22: [UnorderedKey] The AIRFLOW_ADMIN_LASTNAME key should go before the AIRFLOW_ADMIN_PASSWORD key

(UnorderedKey)


[warning] 23-23: [UnorderedKey] The AIRFLOW_ADMIN_EMAIL key should go before the AIRFLOW_ADMIN_FIRSTNAME key

(UnorderedKey)


[warning] 27-27: [UnorderedKey] The SUPERSET_ADMIN_USERNAME key should go before the SUPERSET_SECRET_KEY key

(UnorderedKey)


[warning] 28-28: [UnorderedKey] The SUPERSET_ADMIN_PASSWORD key should go before the SUPERSET_ADMIN_USERNAME key

(UnorderedKey)


[warning] 29-29: [UnorderedKey] The SUPERSET_ADMIN_FIRSTNAME key should go before the SUPERSET_ADMIN_PASSWORD key

(UnorderedKey)


[warning] 30-30: [UnorderedKey] The SUPERSET_ADMIN_LASTNAME key should go before the SUPERSET_ADMIN_PASSWORD key

(UnorderedKey)


[warning] 31-31: [UnorderedKey] The SUPERSET_ADMIN_EMAIL key should go before the SUPERSET_ADMIN_FIRSTNAME key

(UnorderedKey)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.env.example around lines 14 - 31, Update the .env.example to use the
mandated BCSD_DATA_ prefix for all configuration keys instead of raw AIRFLOW_*
and SUPERSET_* names (e.g., replace AIRFLOW__CORE__EXECUTOR,
AIRFLOW_ADMIN_USERNAME/PASSWORD/FIRSTNAME/LASTNAME/EMAIL and
SUPERSET_SECRET_KEY, SUPERSET_ADMIN_* with BCSD_DATA_ prefixed equivalents), and
ensure the Compose/startup layer maps those BCSD_DATA_* env vars back to the
native names when launching services (so look for places that read AIRFLOW__* or
SUPERSET_* and change them to use BCSD_DATA_* mapping at service startup).

Comment on lines +1 to +7
{{
config(
unique_key="event_key",
partition_by={"field": "event_date_dt", "data_type": "date"},
cluster_by=["event_name", "user_pseudo_id"]
)
}}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
fd -t f 'dbt_project.yml' --exec cat {}

Repository: BCSDLab/KOIN_DATA

Length of output: 880


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "event_key" dbt/bcsd_data -S

Repository: BCSDLab/KOIN_DATA

Length of output: 1425


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Find schema.yml related to this bronze_ga4 events model
fd -t f "schema.yml" dbt/bcsd_data -S | sed 's|^|SCHEMA: |'

Repository: BCSDLab/KOIN_DATA

Length of output: 171


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Show schema blocks mentioning this model or event_key
fd -t f "schema.yml" dbt/bcsd_data -S -x sh -c 'rg -n --no-heading "bronze_ga4__events|ga4__events|event_key" "$1" || true' sh

Repository: BCSDLab/KOIN_DATA

Length of output: 197


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Inspect the full model for how event_key is computed and how is_incremental/is_incremental filter is applied
fd -t f "bronze_ga4__events.sql" dbt/bcsd_data -S | while read -r f; do
  echo "FILE: $f"
  # show first 120 lines with numbers
  nl -ba "$f" | sed -n '1,160p'
done

Repository: BCSDLab/KOIN_DATA

Length of output: 171


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Locate where is_incremental() is used inside this model (around line 48 as mentioned)
fd -t f "bronze_ga4__events.sql" dbt/bcsd_data -S | while read -r f; do
  echo "IS_INCREMENTAL CONTEXT: $f"
  nl -ba "$f" | sed -n '35,80p'
done

Repository: BCSDLab/KOIN_DATA

Length of output: 171


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Locate any tests enforcing uniqueness on event_key for GA4 events
fd -t f "tests" dbt/bcsd_data -S || true
rg -n "unique.*event_key|event_key.*unique" dbt/bcsd_data -S || true

Repository: BCSDLab/KOIN_DATA

Length of output: 537


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Inspect bronze GA4 schema.yml for tests/unique constraints on event_key
fd -t f "schema.yml" dbt/bcsd_data/models/bronze/ga4 --exec sh -c 'echo "--- {} ---"; cat "{}"' || true

Repository: BCSDLab/KOIN_DATA

Length of output: 446


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Inspect bronze_ga4__events.sql around is_incremental and event_key computation
nl -ba dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sql | sed -n '1,140p'

Repository: BCSDLab/KOIN_DATA

Length of output: 104


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Look specifically for uniqueness tests on event_key anywhere under bronze/ga4
rg -n --no-heading "event_key" dbt/bcsd_data/models/bronze/ga4/schema.yml dbt/bcsd_data/models/bronze/ga4 -S

Repository: BCSDLab/KOIN_DATA

Length of output: 364


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# show bronze_ga4__events.sql with line numbers using cat -n (since nl missing)
cat -n dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sql | sed -n '1,160p'

Repository: BCSDLab/KOIN_DATA

Length of output: 1446


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "is_incremental\\(|event_key\\s*=|event_bundle_sequence_id|event_previous_timestamp|event_id" dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sql

Repository: BCSDLab/KOIN_DATA

Length of output: 90


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "event_id|event_bundle|bundle_sequence|sequence|previous_timestamp|user_id" dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sql

Repository: BCSDLab/KOIN_DATA

Length of output: 77


🏁 Script executed:

#!/bin/bash
set -euo pipefail
# list selected column aliases around canonical select to see what identifiers are actually present
rg -n "select\\s*$|from staged|loaded_at|event_\\w+" dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sql

Repository: BCSDLab/KOIN_DATA

Length of output: 624


Fix the incorrect materialized='incremental' claim; keep the event_key collision risk as a design concern.
This model inherits +materialized: incremental for models: bcsd_data: bronze in dbt/bcsd_data/dbt_project.yml, so is_incremental() should work even without materialized='incremental' in this block.
However, event_key is only md5(event_timestamp|user_pseudo_id|event_name), and bronze_ga4__events.event_key is unique tested; with incremental_strategy: merge, any true source duplicates at that grain will be collapsed to one row (potentially silently losing rows).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sql` around lines 1 - 7,
The config block incorrectly claims materialized='incremental' is needed; remove
that misleading claim and rely on the inherited +materialized: incremental so
is_incremental() continues to work; also add a short inline comment near the
config or the event_key definition (event_key = md5(event_timestamp ||
user_pseudo_id || event_name)) calling out the collision/data-loss risk when
using incremental_strategy: merge with a unique test on
bronze_ga4__events.event_key so reviewers know duplicates at that grain may be
collapsed silently.

Comment on lines +15 to +21
to_hex(md5(concat(
coalesce(cast(event_timestamp as string), ''),
'|',
coalesce(user_pseudo_id, ''),
'|',
coalesce(event_name, '')
))) as event_key,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

event_key is not unique enough; unique test will likely fail and the merge can drop valid rows.

(event_timestamp, user_pseudo_id, event_name) is not a natural key in GA4 — a single user can fire multiple events with the same name at the same microsecond timestamp (batched/duplicated hits). With this key:

  • the unique test on event_key in schema.yml will fail on full-refresh loads, and
  • the incremental unique_key merge will silently overwrite/drop those distinct rows.

Consider incorporating additional disambiguating fields (e.g., event_bundle_sequence_id, ga_session_id, and/or a row_number() over the natural key) into the hash.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sql` around lines 15 - 21,
The current event_key hash (constructed from event_timestamp, user_pseudo_id,
event_name) is not sufficiently unique; update the event_key generation in
bronze_ga4__events.sql (the to_hex(md5(concat(...))) expression) to include
additional disambiguating identifiers such as event_bundle_sequence_id and/or
ga_session_id, or incorporate a deterministic row discriminator (e.g., a
row_number() over the natural key partition) into the concatenated string so the
resulting event_key is truly unique for each physical event; ensure this change
aligns with the unique_key used by the incremental merge and the unique test in
schema.yml.

Comment on lines +25 to +37
joined as (
select
to_hex(md5(concat(e.user_pseudo_id, '|', coalesce(sp.ga_session_id, '')))) as session_key,
e.user_pseudo_id,
sp.ga_session_id,
min(e.event_date_dt) as session_date,
min(e.event_timestamp) as session_start_at,
max(e.event_timestamp) as session_end_at,
count(*) as event_count
from events e
left join session_params sp using (event_key)
group by 1, 2, 3
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Session-less events collapse into one synthetic session per user.

coalesce(sp.ga_session_id, '') means every event without a ga_session_id for a given user_pseudo_id is hashed into the same session_key. Across days/hours this merges unrelated activity into a single "session" with inflated event_count and a session_start_at/session_end_at spanning the user's entire lifetime. Consider excluding rows with a null ga_session_id from the session build, or keying them distinctly so they aren't conflated.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbt/bcsd_data/models/silver/ga4/silver_ga4__sessions.sql` around lines 25 -
37, The current session_key uses coalesce(sp.ga_session_id, '') which collapses
all null-session events for a user into one synthetic session; change the logic
in the joined CTE to either exclude rows where sp.ga_session_id IS NULL from the
session aggregation (i.e., add WHERE sp.ga_session_id IS NOT NULL before
grouping) or make null sessions unique per event/window by replacing
coalesce(sp.ga_session_id, '') with a distinct fallback (for example using
sp.ga_session_id || '|' || e.event_key or including a truncated
event_timestamp/event_date) so that session_key (the to_hex(md5(...))
expression) no longer conflates unrelated events for the same user_pseudo_id.

Comment thread dbt/koin/models/staging/ga4/schema.yml Outdated
Comment on lines +16 to +18
- name: user_pseudo_id
tests:
- not_null

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Can GA4 BigQuery export events have a null user_pseudo_id under consent mode?

💡 Result:

Yes, GA4 BigQuery export events can have a null user_pseudo_id when using Google’s Advanced Consent Mode [1][2]. When a user denies consent for analytics_storage (and/or ads_storage) in Advanced Consent Mode, GA4 still collects events in the form of "cookieless pings" [1][3]. In these cases, to comply with the user's privacy choices and avoid storing pseudonymous identifiers, GA4 strips identifiable information from the event [3][4]. Specifically: - user_pseudo_id is set to null [1][5]. - ga_session_id (located in event_params) is typically null or absent [1][2]. - The privacy_info.analytics_storage field will reflect a "No" status for these events [2][4]. This behavior is unique to Advanced Consent Mode [1][3]. In Basic Consent Mode, GA4 tags do not fire at all if consent is denied, meaning no data (and therefore no null-identifier rows) is sent to BigQuery [1][3]. These "orphaned" rows with null identifiers are useful for understanding total traffic volume and consent rates, but they cannot be used for traditional user or session counting because they lack the necessary identifiers to group activity [1][2][4]. Practitioners often filter these rows out at the model layer when calculating sessions or unique users to prevent data corruption [1][5].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate and show the relevant schema.yml snippet with context
SCHEMA="dbt/bcsd_data/models/staging/ga4/schema.yml"
echo "=== File: $SCHEMA ==="
if [ -f "$SCHEMA" ]; then
  nl -ba "$SCHEMA" | sed -n '1,120p'
else
  echo "Missing: $SCHEMA"
fi

echo
echo "=== scripts/run_dbt.sh ==="
if [ -f "./scripts/run_dbt.sh" ]; then
  nl -ba "./scripts/run_dbt.sh" | sed -n '1,200p'
else
  echo "Missing: ./scripts/run_dbt.sh"
fi

echo
echo "=== rg: run_dbt build invocation ==="
rg -n "run_dbt\.sh build|dbt build|dbt test" -S ./scripts ./dbt 2>/dev/null || true

Repository: BCSDLab/KOIN_DATA

Length of output: 180


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate and show the relevant schema.yml snippet with context
SCHEMA="dbt/bcsd_data/models/staging/ga4/schema.yml"
echo "=== File: $SCHEMA ==="
if [ -f "$SCHEMA" ]; then
  nl -ba "$SCHEMA" | sed -n '1,120p'
else
  echo "Missing: $SCHEMA"
fi

echo
echo "=== scripts/run_dbt.sh ==="
if [ -f "./scripts/run_dbt.sh" ]; then
  nl -ba "./scripts/run_dbt.sh" | sed -n '1,200p'
else
  echo "Missing: ./scripts/run_dbt.sh"
fi

echo
echo "=== rg: run_dbt build invocation ==="
rg -n "run_dbt\.sh build|dbt build|dbt test" -S ./scripts ./dbt 2>/dev/null || true

Repository: BCSDLab/KOIN_DATA

Length of output: 180


🏁 Script executed:

#!/bin/bash
set -euo pipefail

SCHEMA="dbt/bcsd_data/models/staging/ga4/schema.yml"
echo "=== File: $SCHEMA (first 200 lines) ==="
if [ -f "$SCHEMA" ]; then
  awk '{printf "%d\t%s\n", NR, $0}' "$SCHEMA" | sed -n '1,200p'
else
  echo "Missing: $SCHEMA"
fi

echo
echo "=== scripts/run_dbt.sh (first 200 lines) ==="
if [ -f "./scripts/run_dbt.sh" ]; then
  awk '{printf "%d\t%s\n", NR, $0}' "./scripts/run_dbt.sh" | sed -n '1,200p'
else
  echo "Missing: ./scripts/run_dbt.sh"
fi

echo
echo "=== rg: dbt build/test invocation ==="
rg -n "run_dbt\.sh build|dbt build|dbt test" -S ./scripts ./dbt 2>/dev/null || true

Repository: BCSDLab/KOIN_DATA

Length of output: 865


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find the model SQL for stg_ga4__events
echo "=== rg: stg_ga4__events definition ==="
rg -n "stg_ga4__events" dbt/bcsd_data/models -S || true

echo
echo "=== rg: user_pseudo_id usage in GA4 staging ==="
rg -n "user_pseudo_id" dbt/bcsd_data/models/staging/ga4 -S || true

# Show candidate SQL files around where user_pseudo_id is selected
echo
echo "=== Candidate SQL files containing user_pseudo_id ==="
files=$(rg -l "user_pseudo_id" dbt/bcsd_data/models/staging/ga4 -S || true)
for f in $files; do
  echo "--- $f (first 200 lines) ---"
  awk '{printf "%d\t%s\n", NR, $0}' "$f" | sed -n '1,200p'
  echo
done

Repository: BCSDLab/KOIN_DATA

Length of output: 1882


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== dbt_project.yml ==="
if [ -f "dbt/bcsd_data/dbt_project.yml" ]; then
  awk '{printf "%d\t%s\n", NR, $0}' dbt/bcsd_data/dbt_project.yml | sed -n '1,220p'
else
  echo "Missing: dbt/bcsd_data/dbt_project.yml"
fi

echo
echo "=== rg: 'severity' in dbt project ==="
rg -n "severity:" dbt/bcsd_data -S || true

echo
echo "=== rg: default test severity / config overrides ==="
rg -n "(tests:|test:|test\s*:|on-run-end|fail|warn)" dbt/bcsd_data -S || true

echo
echo "=== stg_ga4__events source SQL macro (ga4_events_source_sql) usage ==="
rg -n "ga4_events_source_sql" dbt/bcsd_data -S || true

Repository: BCSDLab/KOIN_DATA

Length of output: 2032


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== stg_ga4__events.sql (full file) ==="
awk '{print}' dbt/bcsd_data/models/staging/ga4/stg_ga4__events.sql | sed -n '1,200p'

echo
echo "=== Macro: dbt/bcsd_data/macros/ga4_events_source.sql (full file) ==="
awk '{print}' dbt/bcsd_data/macros/ga4_events_source.sql | sed -n '1,220p'

echo
echo "=== rg: null/filter/coalesce for user_pseudo_id in GA4 models ==="
rg -n "user_pseudo_id\s*(is\s+null|=+\s*null|!=\s*null)|coalesce\(\s*user_pseudo_id|ifnull\(\s*user_pseudo_id|nullif\(\s*user_pseudo_id" dbt/bcsd_data/models/staging/ga4 dbt/bcsd_data/models/bronze/ga4 dbt/bcsd_data/models/silver/ga4 -S || true

echo
echo "=== stg/bronze/silver schema.yml for user_pseudo_id tests ==="
for f in dbt/bcsd_data/models/*/ga4/schema.yml; do
  echo "--- $f ---"
  awk '{print}' "$f" | sed -n '1,120p'
done

Repository: BCSDLab/KOIN_DATA

Length of output: 3385


Relax not_null on stg_ga4__events.user_pseudo_id

      - name: user_pseudo_id
        tests:
          - not_null

GA4 Advanced Consent Mode can generate “cookieless pings” where user_pseudo_id is NULL, and stg_ga4__events.sql selects user_pseudo_id directly (no null-handling). This not_null test can therefore fail ./scripts/run_dbt.sh build.

Change the test to warn (e.g., not_null: { severity: warn }) or remove it from staging.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dbt/bcsd_data/models/staging/ga4/schema.yml` around lines 16 - 18, The
current schema test enforces not_null on stg_ga4__events.user_pseudo_id which
fails when GA4 Advanced Consent Mode emits cookieless pings; update the test in
schema.yml for the field user_pseudo_id (under the stg_ga4__events model) to
relax the constraint—either remove the not_null test or change it to a warning
(e.g., set not_null: { severity: warn }) so null user_pseudo_id values no longer
fail the build.

Comment thread docker-compose.yml
Comment on lines +58 to +73
airflow-webserver:
<<: *airflow-common
command: webserver
ports:
- "8080:8080"
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 5
restart: unless-stopped

airflow-scheduler:
<<: *airflow-common
command: scheduler
restart: unless-stopped

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Webserver and scheduler can start before airflow-init finishes.

Both services inherit only depends_on: postgres (service_healthy) from x-airflow-common. They don't wait for airflow-init to complete airflow db migrate and admin creation, so the scheduler/webserver may boot against an unmigrated metadata DB and crash-loop on first startup. Add an explicit dependency on airflow-init completing successfully.

🔧 Proposed fix
   airflow-webserver:
     <<: *airflow-common
     command: webserver
+    depends_on:
+      postgres:
+        condition: service_healthy
+      airflow-init:
+        condition: service_completed_successfully
     ports:
       - "8080:8080"
   airflow-scheduler:
     <<: *airflow-common
     command: scheduler
+    depends_on:
+      postgres:
+        condition: service_healthy
+      airflow-init:
+        condition: service_completed_successfully
     restart: unless-stopped
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
airflow-webserver:
<<: *airflow-common
command: webserver
ports:
- "8080:8080"
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 5
restart: unless-stopped
airflow-scheduler:
<<: *airflow-common
command: scheduler
restart: unless-stopped
airflow-webserver:
<<: *airflow-common
command: webserver
depends_on:
postgres:
condition: service_healthy
airflow-init:
condition: service_completed_successfully
ports:
- "8080:8080"
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 5
restart: unless-stopped
airflow-scheduler:
<<: *airflow-common
command: scheduler
depends_on:
postgres:
condition: service_healthy
airflow-init:
condition: service_completed_successfully
restart: unless-stopped
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docker-compose.yml` around lines 58 - 73, The webserver and scheduler
services currently only depend on the database health and can start before the
one-time init job finishes; update both airflow-webserver and airflow-scheduler
service blocks to add an explicit depends_on entry for airflow-init with the
service_completed_successfully condition (e.g., depends_on: airflow-init:
condition: service_completed_successfully) so they wait for the init job
(airflow db migrate + admin creation) to finish before starting; keep the
existing postgres service_healthy dependency as well.

@@ -0,0 +1,9 @@
import os

SECRET_KEY = os.environ.get("SUPERSET_SECRET_KEY", "change_me")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid an insecure default SECRET_KEY.

SECRET_KEY signs sessions and CSRF tokens; falling back to a hardcoded "change_me" means any deployment that forgets to set SUPERSET_SECRET_KEY runs with a publicly known key, allowing session forgery. Fail fast instead of silently defaulting.

🔒 Proposed fix
-SECRET_KEY = os.environ.get("SUPERSET_SECRET_KEY", "change_me")
+SECRET_KEY = os.environ["SUPERSET_SECRET_KEY"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SECRET_KEY = os.environ.get("SUPERSET_SECRET_KEY", "change_me")
SECRET_KEY = os.environ["SUPERSET_SECRET_KEY"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docker/superset/superset_config.py` at line 3, SECRET_KEY currently defaults
to an insecure literal; change behavior in superset_config.py so that SECRET_KEY
is obtained from os.environ["SUPERSET_SECRET_KEY"] (or explicitly validated) and
the process fails fast with a clear error if the variable is missing/empty
instead of falling back to "change_me"; update the SECRET_KEY assignment and add
a check that raises a RuntimeError (or exits) with a descriptive message
referencing SUPERSET_SECRET_KEY so deployments cannot accidentally run with a
public key.

#!/usr/bin/env bash
set -euo pipefail

BACKUP_DIR="${BACKUP_DIR:-backups/postgres}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Rename backup dir env var to use the required BCSD_DATA_ prefix.

BACKUP_DIR does not follow the required configuration variable prefix convention.

Suggested fix
-BACKUP_DIR="${BACKUP_DIR:-backups/postgres}"
+BACKUP_DIR="${BCSD_DATA_BACKUP_DIR:-${BACKUP_DIR:-backups/postgres}}"

As per coding guidelines, "**/*.{sh,py,dockerfile,yaml,yml,.env.example}: Use BCSD_DATA_ as the environment variable prefix for all configuration".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
BACKUP_DIR="${BACKUP_DIR:-backups/postgres}"
BACKUP_DIR="${BCSD_DATA_BACKUP_DIR:-${BACKUP_DIR:-backups/postgres}}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/backup_postgres.sh` at line 4, The BACKUP_DIR env var doesn't follow
the required BCSD_DATA_ prefix; rename the variable to BCSD_DATA_BACKUP_DIR and
update its default assignment (replace
BACKUP_DIR="${BACKUP_DIR:-backups/postgres}" with
BCSD_DATA_BACKUP_DIR="${BCSD_DATA_BACKUP_DIR:-backups/postgres}"), then replace
all uses of BACKUP_DIR in the script with BCSD_DATA_BACKUP_DIR and update any
related docs/env examples to the new variable name.

Comment thread scripts/run_dbt.sh Outdated
COMMAND="${1:-build}"
shift || true

docker compose run --rm airflow-webserver bash -lc "cd /opt/airflow/dbt/bcsd_data && dbt ${COMMAND} $*"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Use a relative dbt project path in the script.

The hardcoded /opt/airflow/dbt/bcsd_data path violates the repository script rule to prefer relative paths in scripts/config.

Suggested fix
-docker compose run --rm airflow-webserver \
-  dbt "${COMMAND}" "$@" --project-dir /opt/airflow/dbt/bcsd_data
+DBT_PROJECT_DIR="${BCSD_DATA_DBT_PROJECT_DIR:-dbt/bcsd_data}"
+docker compose run --rm airflow-webserver \
+  dbt "${COMMAND}" "$@" --project-dir "${DBT_PROJECT_DIR}"

As per coding guidelines, "**/{scripts,airflow,dbt,config}/**/*.{sh,py,yaml,yml,conf,json}: Prefer relative paths inside scripts and config files".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/run_dbt.sh` at line 7, The docker compose invocation in
scripts/run_dbt.sh hardcodes /opt/airflow/dbt/bcsd_data; change it to use a path
relative to the repository/script location instead: compute the script directory
(use the directory of scripts/run_dbt.sh as the base) and build a relative path
to dbt/bcsd_data, then use that variable in the docker compose run ... bash -lc
"cd <relative-path> && dbt ${COMMAND} $*"; update the docker compose command to
reference this relative-path variable (and keep using the existing COMMAND and
argument expansion).

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify unsafe interpolation pattern exists in script (read-only).
rg -nP --type=sh '\bbash -lc ".*\$\{COMMAND\}.*\$\*' scripts/run_dbt.sh

Repository: BCSDLab/KOIN_DATA

Length of output: 167


Harden dbt invocation to prevent argument injection and word-splitting.

scripts/run_dbt.sh interpolates unquoted ${COMMAND} and $* into a bash -lc "..." command string, so user-provided whitespace/metacharacters can alter parsing/execution. Pass arguments as argv ("$@") and avoid nested shell interpolation.

Suggested fix
-COMMAND="${1:-build}"
-shift || true
-
-docker compose run --rm airflow-webserver bash -lc "cd /opt/airflow/dbt/bcsd_data &amp;&amp; dbt ${COMMAND} $*"
+COMMAND="${1:-build}"
+shift || true
+
+docker compose run --rm airflow-webserver \
+  dbt "${COMMAND}" "$@" --project-dir /opt/airflow/dbt/bcsd_data
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
docker compose run --rm airflow-webserver bash -lc "cd /opt/airflow/dbt/bcsd_data && dbt ${COMMAND} $*"
COMMAND="${1:-build}"
shift || true
docker compose run --rm airflow-webserver \
dbt "${COMMAND}" "$@" --project-dir /opt/airflow/dbt/bcsd_data
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/run_dbt.sh` at line 7, The current line embeds unquoted ${COMMAND}
and $* into a double-quoted bash -lc string, enabling argument injection and
word-splitting; change the invocation to pass arguments as argv and quote them:
stop interpolating ${COMMAND} and $* inside the command string, use "$COMMAND"
and "$@" (not $*) as positional parameters, call bash -lc with a single-quoted
command that uses exec dbt "$@" (or cd ... && exec dbt "$@"), and pass the
command and remaining args after a -- so the container receives them as argv
rather than being re-parsed by the outer shell.

@RimSM RimSM force-pushed the chore/data-platform-skeleton branch from 9a5dd0a to 14dfe1e Compare May 31, 2026 12:34
Comment thread airflow/dags/ga4_hourly_dbt.py Outdated
task_id="dbt_build_ga4",
bash_command=(
"cd /opt/airflow/dbt/koin && "
"dbt build --select tag:ga4 --vars '{lookback_hours: 6}'"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbt에서 중복 알아서 처리 해주나? 이러면 중복 적재 있을거같아서 물어봄

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pk 기준으로 머지하면 문제 없음

@RimSM RimSM force-pushed the chore/data-platform-skeleton branch from 14dfe1e to 6df9e15 Compare May 31, 2026 12:50
@RimSM RimSM force-pushed the chore/data-platform-skeleton branch from 6df9e15 to 66be308 Compare May 31, 2026 12:58
@RimSM RimSM merged commit e81ac9e into main May 31, 2026
1 check passed
@RimSM RimSM deleted the chore/data-platform-skeleton branch May 31, 2026 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants