Add GA4 data platform skeleton#6
Conversation
|
Warning Review limit reached
More reviews will be available in 25 minutes and 14 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (30)
📝 WalkthroughWalkthroughThis PR establishes a complete GA4 data platform monorepo. It adds dbt project configuration, a multi-layer data transformation pipeline (staging→bronze→silver→marts), Docker Compose infrastructure for Airflow and Superset, and operational scripts. The stack enables hourly GA4 event ingestion from BigQuery through incremental dbt models into analytics-ready tables. ChangesGA4 Data Platform Setup
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
🧹 Nitpick comments (7)
dbt/bcsd_data/models/sources/ga4_sources.yml (1)
8-14: ⚖️ Poor tradeoffDeclared source is not actually referenced; lineage is bypassed.
The
ga4source is defined here, butga4_events_source_sql()queries`{{ project }}.{{ dataset }}.events_*`directly rather than via{{ source('ga4', 'events') }}. As a result dbt's lineage graph, source freshness checks, and theDownstream models should not query raw GA4 export tables directlyguideline are not enforced for this path. Consider routing the wildcard scan through the declared source so the dependency is tracked.As per coding guidelines: "Downstream models should not query raw GA4 export tables directly."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dbt/bcsd_data/models/sources/ga4_sources.yml` around lines 8 - 14, The ga4 source is declared but ga4_events_source_sql() currently queries `{{ project }}.{{ dataset }}.events_*` directly, bypassing dbt lineage and freshness; update ga4_events_source_sql() to reference the declared source by using {{ source('ga4', 'events') }} (and similarly handle intraday if applicable) and construct the wildcard scan from that source reference instead of hardcoding project/dataset/table patterns so dbt tracks the dependency and enforces source checks.dbt/bcsd_data/macros/ga4_events_source.sql (1)
4-10: 💤 Low valueLookback conversion drops sub-day granularity; intended as a buffer?
(lookback_hours | int) // 24) + 3floors any value under 24h to 0, so the effective table-pruning window is always at least 3 days regardless oflookback_hours. This is fine as a partition-pruning safety margin (the precise hour filter happens downstream in bronze), but the relationship is non-obvious. A short comment would help future maintainers.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dbt/bcsd_data/macros/ga4_events_source.sql` around lines 4 - 10, The lookback_days calculation (variables lookback_hours and lookback_days) intentionally floors lookback_hours to days via ((lookback_hours | int) // 24) + 3 which causes any <24h value to become 3 days as a safety margin; add a short clarifying comment above these lines explaining that this behavior is deliberate (to ensure partition-pruning safety) and that finer hourly filtering is applied later in the bronze pipeline, so maintainers understand the apparent mismatch between lookback_hours and table-pruning window.dbt/bcsd_data/models/silver/ga4/schema.yml (1)
25-28: ⚡ Quick winAdd
event_keytests forsilver_ga4__pageviewsandsilver_ga4__conversions.Both models declare
unique_key="event_key"but, unlike the other silver models, have nonot_null/uniquetests. Adding them validates the incremental grain and catches duplicate keys early.♻️ Add column tests
- name: silver_ga4__pageviews description: "Page view events with page parameters." + columns: + - name: event_key + tests: + - not_null + - unique - name: silver_ga4__conversions description: "Conversion events configured by `conversion_events` var." + columns: + - name: event_key + tests: + - not_null + - unique🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dbt/bcsd_data/models/silver/ga4/schema.yml` around lines 25 - 28, The schema.yml is missing column tests for event_key on models silver_ga4__pageviews and silver_ga4__conversions even though both use unique_key="event_key"; add dbt column tests by adding not_null and unique tests for the column event_key under each model entry (silver_ga4__pageviews and silver_ga4__conversions) so duplicates and nulls are caught during build and validate the incremental grain.dbt/bcsd_data/models/silver/ga4/silver_ga4__pageviews.sql (1)
9-37: ⚡ Quick winNo incremental filter → full rebuild of pageviews each hourly run.
This model carries an incremental config (
unique_key,partition_by) but, unlikesilver_ga4__events, has nois_incremental()lookback guard. On every hourly run it scans the entiresilver_ga4__eventstable and pivots the fullsilver_ga4__event_paramstable, which scans all params for all event types (not justpage_view). On BigQuery this is a recurring full-table cost that grows with history. Since this is a 1:1 row mapping from events, add a lookback filter mirroring the upstream model.♻️ Add incremental lookback and constrain the param pivot
with events as ( select * from {{ ref('silver_ga4__events') }} + {% if is_incremental() %} + where event_timestamp >= timestamp_sub(current_timestamp(), interval {{ var('lookback_hours', 6) }} hour) + {% endif %} ), params as ( select * from {{ ref('silver_ga4__event_params') }} + {% if is_incremental() %} + where event_timestamp >= timestamp_sub(current_timestamp(), interval {{ var('lookback_hours', 6) }} hour) + {% endif %} ),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dbt/bcsd_data/models/silver/ga4/silver_ga4__pageviews.sql` around lines 9 - 37, The model lacks an is_incremental() lookback so hourly runs rebuild/pivot all events and params; update the SQL to apply the same incremental filter used in silver_ga4__events: in the events CTE add a guard like where (not is_incremental() or event_date_dt >= date_sub(current_date(), interval 1 day)) (or mirror the exact partition lookback used upstream), and restrict the pivoted CTE to only params for those events (e.g., build pivoted from params join events using event_key or add where param_event_key in (select event_key from events) and also filter param_key to only the three page_* keys) so only recent page_view events and their params are scanned during incremental runs.dbt/bcsd_data/models/marts/schema.yml (1)
3-9: ⚡ Quick winAdd basic column tests for mart outputs.
Descriptions are helpful, but marts feeding BI should include at least
not_nulltests on date/metric columns to catch silent data regressions early.Proposed schema test scaffold
version: 2 models: - name: mart_ga4__daily_events description: "Daily event counts for Superset." + columns: + - name: event_date_dt + tests: [not_null] + - name: event_count + tests: [not_null] - name: mart_ga4__daily_pageviews description: "Daily pageview counts for Superset." + columns: + - name: event_date_dt + tests: [not_null] + - name: pageview_count + tests: [not_null] + - name: user_count + tests: [not_null] - name: mart_ga4__daily_conversions description: "Daily conversion counts for Superset." + columns: + - name: event_date_dt + tests: [not_null] + - name: conversion_count + tests: [not_null] + - name: user_count + tests: [not_null]Based on learnings: "Superset should read only
martsor gold tables from dbt".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dbt/bcsd_data/models/marts/schema.yml` around lines 3 - 9, Add basic dbt column tests (not_null) to each mart model to prevent silent regressions: for mart_ga4__daily_events, mart_ga4__daily_pageviews, and mart_ga4__daily_conversions, declare a columns block in schema.yml and add not_null tests for the date column and for the key metric columns (e.g., event count, pageviews, conversions — replace with the actual metric column names used in each model). Ensure each model entry uses the exact model names mart_ga4__daily_events, mart_ga4__daily_pageviews, mart_ga4__daily_conversions and add tests under columns: - name: <date_column> tests: - not_null and - name: <metric_column> tests: - not_null so CI/dbt will fail when these critical fields are null.docker-compose.yml (1)
11-13: ⚡ Quick winCelery result backend is set but no Celery executor/broker is configured.
AIRFLOW__CELERY__RESULT_BACKENDis defined, yet there is noAIRFLOW__CORE__EXECUTOR, broker, or worker service. Airflow defaults toSequentialExecutor, so this setting is unused and misleading. For a Postgres-backed single-host MVP, setLocalExecutorexplicitly (and drop the unused Celery backend) to get parallel task execution and a clear config intent.♻️ Suggested change
environment: + AIRFLOW__CORE__EXECUTOR: LocalExecutor AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://${BCSD_DATA_POSTGRES_USER:-bcsd}:${BCSD_DATA_POSTGRES_PASSWORD:-change_me}`@postgres`:5432/${BCSD_DATA_AIRFLOW_DB:-airflow_metadata} - AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://${BCSD_DATA_POSTGRES_USER:-bcsd}:${BCSD_DATA_POSTGRES_PASSWORD:-change_me}`@postgres`:5432/${BCSD_DATA_AIRFLOW_DB:-airflow_metadata} DBT_PROFILES_DIR: /opt/airflow/dbt/bcsd_data🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docker-compose.yml` around lines 11 - 13, The compose environment sets AIRFLOW__CELERY__RESULT_BACKEND but no Celery broker/executor/worker exists; remove the unused AIRFLOW__CELERY__RESULT_BACKEND entry and instead explicitly set AIRFLOW__CORE__EXECUTOR=LocalExecutor next to AIRFLOW__DATABASE__SQL_ALCHEMY_CONN so Airflow runs with LocalExecutor for parallel tasks; update the service environment to drop the Celery backend variable and add the AIRFLOW__CORE__EXECUTOR key (referencing the AIRFLOW__CELERY__RESULT_BACKEND and AIRFLOW__CORE__EXECUTOR variables to find where to change).airflow/dags/ga4_hourly_dbt.py (1)
18-29: ⚡ Quick winAdd retries and an execution timeout for the dbt tasks.
Both
BashOperators invoke external systems (dbt depsfetches packages over the network;dbt buildhits BigQuery). Withoutretries/retry_delaya transient network or BQ hiccup fails the hourly run, and withoutexecution_timeouta hung dbt invocation can block the single allowed active run indefinitely (max_active_runs=1). Adddefault_argsto the DAG.♻️ Suggested change
+from datetime import datetime, timedelta -from datetime import datetime ... with DAG( dag_id="ga4_hourly_dbt", description="Run dbt models for recent GA4 BigQuery export data.", start_date=datetime(2026, 1, 1), schedule="`@hourly`", catchup=False, max_active_runs=1, + default_args={ + "retries": 2, + "retry_delay": timedelta(minutes=5), + "execution_timeout": timedelta(minutes=30), + }, tags=["ga4", "dbt", "bigquery"], ) as dag:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@airflow/dags/ga4_hourly_dbt.py` around lines 18 - 29, Add durable retry and timeout settings by defining default_args on the DAG and applying them to the dbt BashOperators: create a default_args dict (e.g., include retries, retry_delay using datetime.timedelta, and execution_timeout using datetime.timedelta) and pass it into the DAG constructor (or add the same keys to the dbt_deps and dbt_build operator instantiations). Specifically ensure dbt_deps and dbt_build get retries (e.g., 1–3), a sensible retry_delay (e.g., 5 minutes) and an execution_timeout (e.g., 15–30 minutes) so transient network/BigQuery failures and hung dbt calls don’t kill or block hourly runs (respecting max_active_runs=1).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.env.example:
- Around line 14-31: Update the .env.example to use the mandated BCSD_DATA_
prefix for all configuration keys instead of raw AIRFLOW_* and SUPERSET_* names
(e.g., replace AIRFLOW__CORE__EXECUTOR,
AIRFLOW_ADMIN_USERNAME/PASSWORD/FIRSTNAME/LASTNAME/EMAIL and
SUPERSET_SECRET_KEY, SUPERSET_ADMIN_* with BCSD_DATA_ prefixed equivalents), and
ensure the Compose/startup layer maps those BCSD_DATA_* env vars back to the
native names when launching services (so look for places that read AIRFLOW__* or
SUPERSET_* and change them to use BCSD_DATA_* mapping at service startup).
In `@dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sql`:
- Around line 15-21: The current event_key hash (constructed from
event_timestamp, user_pseudo_id, event_name) is not sufficiently unique; update
the event_key generation in bronze_ga4__events.sql (the to_hex(md5(concat(...)))
expression) to include additional disambiguating identifiers such as
event_bundle_sequence_id and/or ga_session_id, or incorporate a deterministic
row discriminator (e.g., a row_number() over the natural key partition) into the
concatenated string so the resulting event_key is truly unique for each physical
event; ensure this change aligns with the unique_key used by the incremental
merge and the unique test in schema.yml.
- Around line 1-7: The config block incorrectly claims
materialized='incremental' is needed; remove that misleading claim and rely on
the inherited +materialized: incremental so is_incremental() continues to work;
also add a short inline comment near the config or the event_key definition
(event_key = md5(event_timestamp || user_pseudo_id || event_name)) calling out
the collision/data-loss risk when using incremental_strategy: merge with a
unique test on bronze_ga4__events.event_key so reviewers know duplicates at that
grain may be collapsed silently.
In `@dbt/bcsd_data/models/silver/ga4/silver_ga4__sessions.sql`:
- Around line 25-37: The current session_key uses coalesce(sp.ga_session_id, '')
which collapses all null-session events for a user into one synthetic session;
change the logic in the joined CTE to either exclude rows where sp.ga_session_id
IS NULL from the session aggregation (i.e., add WHERE sp.ga_session_id IS NOT
NULL before grouping) or make null sessions unique per event/window by replacing
coalesce(sp.ga_session_id, '') with a distinct fallback (for example using
sp.ga_session_id || '|' || e.event_key or including a truncated
event_timestamp/event_date) so that session_key (the to_hex(md5(...))
expression) no longer conflates unrelated events for the same user_pseudo_id.
In `@dbt/bcsd_data/models/staging/ga4/schema.yml`:
- Around line 16-18: The current schema test enforces not_null on
stg_ga4__events.user_pseudo_id which fails when GA4 Advanced Consent Mode emits
cookieless pings; update the test in schema.yml for the field user_pseudo_id
(under the stg_ga4__events model) to relax the constraint—either remove the
not_null test or change it to a warning (e.g., set not_null: { severity: warn })
so null user_pseudo_id values no longer fail the build.
In `@docker-compose.yml`:
- Around line 58-73: The webserver and scheduler services currently only depend
on the database health and can start before the one-time init job finishes;
update both airflow-webserver and airflow-scheduler service blocks to add an
explicit depends_on entry for airflow-init with the
service_completed_successfully condition (e.g., depends_on: airflow-init:
condition: service_completed_successfully) so they wait for the init job
(airflow db migrate + admin creation) to finish before starting; keep the
existing postgres service_healthy dependency as well.
In `@docker/superset/superset_config.py`:
- Line 3: SECRET_KEY currently defaults to an insecure literal; change behavior
in superset_config.py so that SECRET_KEY is obtained from
os.environ["SUPERSET_SECRET_KEY"] (or explicitly validated) and the process
fails fast with a clear error if the variable is missing/empty instead of
falling back to "change_me"; update the SECRET_KEY assignment and add a check
that raises a RuntimeError (or exits) with a descriptive message referencing
SUPERSET_SECRET_KEY so deployments cannot accidentally run with a public key.
In `@scripts/backup_postgres.sh`:
- Line 4: The BACKUP_DIR env var doesn't follow the required BCSD_DATA_ prefix;
rename the variable to BCSD_DATA_BACKUP_DIR and update its default assignment
(replace BACKUP_DIR="${BACKUP_DIR:-backups/postgres}" with
BCSD_DATA_BACKUP_DIR="${BCSD_DATA_BACKUP_DIR:-backups/postgres}"), then replace
all uses of BACKUP_DIR in the script with BCSD_DATA_BACKUP_DIR and update any
related docs/env examples to the new variable name.
In `@scripts/run_dbt.sh`:
- Line 7: The docker compose invocation in scripts/run_dbt.sh hardcodes
/opt/airflow/dbt/bcsd_data; change it to use a path relative to the
repository/script location instead: compute the script directory (use the
directory of scripts/run_dbt.sh as the base) and build a relative path to
dbt/bcsd_data, then use that variable in the docker compose run ... bash -lc "cd
<relative-path> && dbt ${COMMAND} $*"; update the docker compose command to
reference this relative-path variable (and keep using the existing COMMAND and
argument expansion).
- Line 7: The current line embeds unquoted ${COMMAND} and $* into a
double-quoted bash -lc string, enabling argument injection and word-splitting;
change the invocation to pass arguments as argv and quote them: stop
interpolating ${COMMAND} and $* inside the command string, use "$COMMAND" and
"$@" (not $*) as positional parameters, call bash -lc with a single-quoted
command that uses exec dbt "$@" (or cd ... && exec dbt "$@"), and pass the
command and remaining args after a -- so the container receives them as argv
rather than being re-parsed by the outer shell.
---
Nitpick comments:
In `@airflow/dags/ga4_hourly_dbt.py`:
- Around line 18-29: Add durable retry and timeout settings by defining
default_args on the DAG and applying them to the dbt BashOperators: create a
default_args dict (e.g., include retries, retry_delay using datetime.timedelta,
and execution_timeout using datetime.timedelta) and pass it into the DAG
constructor (or add the same keys to the dbt_deps and dbt_build operator
instantiations). Specifically ensure dbt_deps and dbt_build get retries (e.g.,
1–3), a sensible retry_delay (e.g., 5 minutes) and an execution_timeout (e.g.,
15–30 minutes) so transient network/BigQuery failures and hung dbt calls don’t
kill or block hourly runs (respecting max_active_runs=1).
In `@dbt/bcsd_data/macros/ga4_events_source.sql`:
- Around line 4-10: The lookback_days calculation (variables lookback_hours and
lookback_days) intentionally floors lookback_hours to days via ((lookback_hours
| int) // 24) + 3 which causes any <24h value to become 3 days as a safety
margin; add a short clarifying comment above these lines explaining that this
behavior is deliberate (to ensure partition-pruning safety) and that finer
hourly filtering is applied later in the bronze pipeline, so maintainers
understand the apparent mismatch between lookback_hours and table-pruning
window.
In `@dbt/bcsd_data/models/marts/schema.yml`:
- Around line 3-9: Add basic dbt column tests (not_null) to each mart model to
prevent silent regressions: for mart_ga4__daily_events,
mart_ga4__daily_pageviews, and mart_ga4__daily_conversions, declare a columns
block in schema.yml and add not_null tests for the date column and for the key
metric columns (e.g., event count, pageviews, conversions — replace with the
actual metric column names used in each model). Ensure each model entry uses the
exact model names mart_ga4__daily_events, mart_ga4__daily_pageviews,
mart_ga4__daily_conversions and add tests under columns: - name: <date_column>
tests: - not_null and - name: <metric_column> tests: - not_null so CI/dbt will
fail when these critical fields are null.
In `@dbt/bcsd_data/models/silver/ga4/schema.yml`:
- Around line 25-28: The schema.yml is missing column tests for event_key on
models silver_ga4__pageviews and silver_ga4__conversions even though both use
unique_key="event_key"; add dbt column tests by adding not_null and unique tests
for the column event_key under each model entry (silver_ga4__pageviews and
silver_ga4__conversions) so duplicates and nulls are caught during build and
validate the incremental grain.
In `@dbt/bcsd_data/models/silver/ga4/silver_ga4__pageviews.sql`:
- Around line 9-37: The model lacks an is_incremental() lookback so hourly runs
rebuild/pivot all events and params; update the SQL to apply the same
incremental filter used in silver_ga4__events: in the events CTE add a guard
like where (not is_incremental() or event_date_dt >= date_sub(current_date(),
interval 1 day)) (or mirror the exact partition lookback used upstream), and
restrict the pivoted CTE to only params for those events (e.g., build pivoted
from params join events using event_key or add where param_event_key in (select
event_key from events) and also filter param_key to only the three page_* keys)
so only recent page_view events and their params are scanned during incremental
runs.
In `@dbt/bcsd_data/models/sources/ga4_sources.yml`:
- Around line 8-14: The ga4 source is declared but ga4_events_source_sql()
currently queries `{{ project }}.{{ dataset }}.events_*` directly, bypassing dbt
lineage and freshness; update ga4_events_source_sql() to reference the declared
source by using {{ source('ga4', 'events') }} (and similarly handle intraday if
applicable) and construct the wildcard scan from that source reference instead
of hardcoding project/dataset/table patterns so dbt tracks the dependency and
enforces source checks.
In `@docker-compose.yml`:
- Around line 11-13: The compose environment sets
AIRFLOW__CELERY__RESULT_BACKEND but no Celery broker/executor/worker exists;
remove the unused AIRFLOW__CELERY__RESULT_BACKEND entry and instead explicitly
set AIRFLOW__CORE__EXECUTOR=LocalExecutor next to
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN so Airflow runs with LocalExecutor for
parallel tasks; update the service environment to drop the Celery backend
variable and add the AIRFLOW__CORE__EXECUTOR key (referencing the
AIRFLOW__CELERY__RESULT_BACKEND and AIRFLOW__CORE__EXECUTOR variables to find
where to change).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3709be80-f7a1-4a19-b866-33971abab53d
📒 Files selected for processing (46)
.env.example.gitignoreAGENTS.mdREADME.mdairflow/dags/ga4_hourly_dbt.pyairflow/plugins/.gitkeepdbt/bcsd_data/analyses/.gitkeepdbt/bcsd_data/dbt_project.ymldbt/bcsd_data/macros/.gitkeepdbt/bcsd_data/macros/ga4_events_source.sqldbt/bcsd_data/macros/generate_schema_name.sqldbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sqldbt/bcsd_data/models/bronze/ga4/schema.ymldbt/bcsd_data/models/marts/ecommerce/mart_ga4__daily_conversions.sqldbt/bcsd_data/models/marts/marketing/mart_ga4__daily_events.sqldbt/bcsd_data/models/marts/product/mart_ga4__daily_pageviews.sqldbt/bcsd_data/models/marts/schema.ymldbt/bcsd_data/models/silver/ga4/schema.ymldbt/bcsd_data/models/silver/ga4/silver_ga4__conversions.sqldbt/bcsd_data/models/silver/ga4/silver_ga4__event_params.sqldbt/bcsd_data/models/silver/ga4/silver_ga4__events.sqldbt/bcsd_data/models/silver/ga4/silver_ga4__pageviews.sqldbt/bcsd_data/models/silver/ga4/silver_ga4__sessions.sqldbt/bcsd_data/models/silver/ga4/silver_ga4__users.sqldbt/bcsd_data/models/sources/ga4_sources.ymldbt/bcsd_data/models/staging/ga4/schema.ymldbt/bcsd_data/models/staging/ga4/stg_ga4__events.sqldbt/bcsd_data/packages.ymldbt/bcsd_data/profiles.yml.exampledbt/bcsd_data/seeds/.gitkeepdbt/bcsd_data/snapshots/.gitkeepdbt/bcsd_data/tests/.gitkeepdocker-compose.ymldocker/airflow/Dockerfiledocker/airflow/requirements.txtdocker/postgres/init-databases.shdocker/superset/Dockerfiledocker/superset/requirements.txtdocker/superset/superset_config.pydocs/.gitkeepdocs/architecture.mddocs/operations.mdscripts/.gitkeepscripts/backup_postgres.shscripts/init_airflow.shscripts/run_dbt.sh
| AIRFLOW_UID=50000 | ||
| AIRFLOW__CORE__EXECUTOR=LocalExecutor | ||
| AIRFLOW__CORE__LOAD_EXAMPLES=false | ||
| AIRFLOW__WEBSERVER__EXPOSE_CONFIG=false | ||
| AIRFLOW__WEBSERVER__SECRET_KEY=change_me | ||
| AIRFLOW_ADMIN_USERNAME=admin | ||
| AIRFLOW_ADMIN_PASSWORD=admin | ||
| AIRFLOW_ADMIN_FIRSTNAME=BCSD | ||
| AIRFLOW_ADMIN_LASTNAME=Admin | ||
| AIRFLOW_ADMIN_EMAIL=admin@example.com | ||
|
|
||
| # Superset | ||
| SUPERSET_SECRET_KEY=change_me | ||
| SUPERSET_ADMIN_USERNAME=admin | ||
| SUPERSET_ADMIN_PASSWORD=admin | ||
| SUPERSET_ADMIN_FIRSTNAME=BCSD | ||
| SUPERSET_ADMIN_LASTNAME=Admin | ||
| SUPERSET_ADMIN_EMAIL=admin@example.com |
There was a problem hiding this comment.
Unprefixed configuration variables break the repo env naming rule.
AIRFLOW_* / SUPERSET_* keys in this template violate the required BCSD_DATA_ prefix convention for configuration. Please expose prefixed config keys here and map them to tool-native env names where they are consumed (e.g., Compose/service startup layer).
As per coding guidelines: "Use BCSD_DATA_ as the environment variable prefix for all configuration".
🧰 Tools
🪛 dotenv-linter (4.0.0)
[warning] 19-19: [UnorderedKey] The AIRFLOW_ADMIN_USERNAME key should go before the AIRFLOW_UID key
(UnorderedKey)
[warning] 20-20: [UnorderedKey] The AIRFLOW_ADMIN_PASSWORD key should go before the AIRFLOW_ADMIN_USERNAME key
(UnorderedKey)
[warning] 21-21: [UnorderedKey] The AIRFLOW_ADMIN_FIRSTNAME key should go before the AIRFLOW_ADMIN_PASSWORD key
(UnorderedKey)
[warning] 22-22: [UnorderedKey] The AIRFLOW_ADMIN_LASTNAME key should go before the AIRFLOW_ADMIN_PASSWORD key
(UnorderedKey)
[warning] 23-23: [UnorderedKey] The AIRFLOW_ADMIN_EMAIL key should go before the AIRFLOW_ADMIN_FIRSTNAME key
(UnorderedKey)
[warning] 27-27: [UnorderedKey] The SUPERSET_ADMIN_USERNAME key should go before the SUPERSET_SECRET_KEY key
(UnorderedKey)
[warning] 28-28: [UnorderedKey] The SUPERSET_ADMIN_PASSWORD key should go before the SUPERSET_ADMIN_USERNAME key
(UnorderedKey)
[warning] 29-29: [UnorderedKey] The SUPERSET_ADMIN_FIRSTNAME key should go before the SUPERSET_ADMIN_PASSWORD key
(UnorderedKey)
[warning] 30-30: [UnorderedKey] The SUPERSET_ADMIN_LASTNAME key should go before the SUPERSET_ADMIN_PASSWORD key
(UnorderedKey)
[warning] 31-31: [UnorderedKey] The SUPERSET_ADMIN_EMAIL key should go before the SUPERSET_ADMIN_FIRSTNAME key
(UnorderedKey)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.env.example around lines 14 - 31, Update the .env.example to use the
mandated BCSD_DATA_ prefix for all configuration keys instead of raw AIRFLOW_*
and SUPERSET_* names (e.g., replace AIRFLOW__CORE__EXECUTOR,
AIRFLOW_ADMIN_USERNAME/PASSWORD/FIRSTNAME/LASTNAME/EMAIL and
SUPERSET_SECRET_KEY, SUPERSET_ADMIN_* with BCSD_DATA_ prefixed equivalents), and
ensure the Compose/startup layer maps those BCSD_DATA_* env vars back to the
native names when launching services (so look for places that read AIRFLOW__* or
SUPERSET_* and change them to use BCSD_DATA_* mapping at service startup).
| {{ | ||
| config( | ||
| unique_key="event_key", | ||
| partition_by={"field": "event_date_dt", "data_type": "date"}, | ||
| cluster_by=["event_name", "user_pseudo_id"] | ||
| ) | ||
| }} |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
fd -t f 'dbt_project.yml' --exec cat {}Repository: BCSDLab/KOIN_DATA
Length of output: 880
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "event_key" dbt/bcsd_data -SRepository: BCSDLab/KOIN_DATA
Length of output: 1425
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find schema.yml related to this bronze_ga4 events model
fd -t f "schema.yml" dbt/bcsd_data -S | sed 's|^|SCHEMA: |'Repository: BCSDLab/KOIN_DATA
Length of output: 171
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show schema blocks mentioning this model or event_key
fd -t f "schema.yml" dbt/bcsd_data -S -x sh -c 'rg -n --no-heading "bronze_ga4__events|ga4__events|event_key" "$1" || true' shRepository: BCSDLab/KOIN_DATA
Length of output: 197
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the full model for how event_key is computed and how is_incremental/is_incremental filter is applied
fd -t f "bronze_ga4__events.sql" dbt/bcsd_data -S | while read -r f; do
echo "FILE: $f"
# show first 120 lines with numbers
nl -ba "$f" | sed -n '1,160p'
doneRepository: BCSDLab/KOIN_DATA
Length of output: 171
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate where is_incremental() is used inside this model (around line 48 as mentioned)
fd -t f "bronze_ga4__events.sql" dbt/bcsd_data -S | while read -r f; do
echo "IS_INCREMENTAL CONTEXT: $f"
nl -ba "$f" | sed -n '35,80p'
doneRepository: BCSDLab/KOIN_DATA
Length of output: 171
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate any tests enforcing uniqueness on event_key for GA4 events
fd -t f "tests" dbt/bcsd_data -S || true
rg -n "unique.*event_key|event_key.*unique" dbt/bcsd_data -S || trueRepository: BCSDLab/KOIN_DATA
Length of output: 537
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect bronze GA4 schema.yml for tests/unique constraints on event_key
fd -t f "schema.yml" dbt/bcsd_data/models/bronze/ga4 --exec sh -c 'echo "--- {} ---"; cat "{}"' || trueRepository: BCSDLab/KOIN_DATA
Length of output: 446
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect bronze_ga4__events.sql around is_incremental and event_key computation
nl -ba dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sql | sed -n '1,140p'Repository: BCSDLab/KOIN_DATA
Length of output: 104
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look specifically for uniqueness tests on event_key anywhere under bronze/ga4
rg -n --no-heading "event_key" dbt/bcsd_data/models/bronze/ga4/schema.yml dbt/bcsd_data/models/bronze/ga4 -SRepository: BCSDLab/KOIN_DATA
Length of output: 364
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# show bronze_ga4__events.sql with line numbers using cat -n (since nl missing)
cat -n dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sql | sed -n '1,160p'Repository: BCSDLab/KOIN_DATA
Length of output: 1446
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "is_incremental\\(|event_key\\s*=|event_bundle_sequence_id|event_previous_timestamp|event_id" dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sqlRepository: BCSDLab/KOIN_DATA
Length of output: 90
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "event_id|event_bundle|bundle_sequence|sequence|previous_timestamp|user_id" dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sqlRepository: BCSDLab/KOIN_DATA
Length of output: 77
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# list selected column aliases around canonical select to see what identifiers are actually present
rg -n "select\\s*$|from staged|loaded_at|event_\\w+" dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sqlRepository: BCSDLab/KOIN_DATA
Length of output: 624
Fix the incorrect materialized='incremental' claim; keep the event_key collision risk as a design concern.
This model inherits +materialized: incremental for models: bcsd_data: bronze in dbt/bcsd_data/dbt_project.yml, so is_incremental() should work even without materialized='incremental' in this block.
However, event_key is only md5(event_timestamp|user_pseudo_id|event_name), and bronze_ga4__events.event_key is unique tested; with incremental_strategy: merge, any true source duplicates at that grain will be collapsed to one row (potentially silently losing rows).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sql` around lines 1 - 7,
The config block incorrectly claims materialized='incremental' is needed; remove
that misleading claim and rely on the inherited +materialized: incremental so
is_incremental() continues to work; also add a short inline comment near the
config or the event_key definition (event_key = md5(event_timestamp ||
user_pseudo_id || event_name)) calling out the collision/data-loss risk when
using incremental_strategy: merge with a unique test on
bronze_ga4__events.event_key so reviewers know duplicates at that grain may be
collapsed silently.
| to_hex(md5(concat( | ||
| coalesce(cast(event_timestamp as string), ''), | ||
| '|', | ||
| coalesce(user_pseudo_id, ''), | ||
| '|', | ||
| coalesce(event_name, '') | ||
| ))) as event_key, |
There was a problem hiding this comment.
event_key is not unique enough; unique test will likely fail and the merge can drop valid rows.
(event_timestamp, user_pseudo_id, event_name) is not a natural key in GA4 — a single user can fire multiple events with the same name at the same microsecond timestamp (batched/duplicated hits). With this key:
- the
uniquetest onevent_keyinschema.ymlwill fail on full-refresh loads, and - the incremental
unique_keymerge will silently overwrite/drop those distinct rows.
Consider incorporating additional disambiguating fields (e.g., event_bundle_sequence_id, ga_session_id, and/or a row_number() over the natural key) into the hash.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@dbt/bcsd_data/models/bronze/ga4/bronze_ga4__events.sql` around lines 15 - 21,
The current event_key hash (constructed from event_timestamp, user_pseudo_id,
event_name) is not sufficiently unique; update the event_key generation in
bronze_ga4__events.sql (the to_hex(md5(concat(...))) expression) to include
additional disambiguating identifiers such as event_bundle_sequence_id and/or
ga_session_id, or incorporate a deterministic row discriminator (e.g., a
row_number() over the natural key partition) into the concatenated string so the
resulting event_key is truly unique for each physical event; ensure this change
aligns with the unique_key used by the incremental merge and the unique test in
schema.yml.
| joined as ( | ||
| select | ||
| to_hex(md5(concat(e.user_pseudo_id, '|', coalesce(sp.ga_session_id, '')))) as session_key, | ||
| e.user_pseudo_id, | ||
| sp.ga_session_id, | ||
| min(e.event_date_dt) as session_date, | ||
| min(e.event_timestamp) as session_start_at, | ||
| max(e.event_timestamp) as session_end_at, | ||
| count(*) as event_count | ||
| from events e | ||
| left join session_params sp using (event_key) | ||
| group by 1, 2, 3 | ||
| ) |
There was a problem hiding this comment.
Session-less events collapse into one synthetic session per user.
coalesce(sp.ga_session_id, '') means every event without a ga_session_id for a given user_pseudo_id is hashed into the same session_key. Across days/hours this merges unrelated activity into a single "session" with inflated event_count and a session_start_at/session_end_at spanning the user's entire lifetime. Consider excluding rows with a null ga_session_id from the session build, or keying them distinctly so they aren't conflated.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@dbt/bcsd_data/models/silver/ga4/silver_ga4__sessions.sql` around lines 25 -
37, The current session_key uses coalesce(sp.ga_session_id, '') which collapses
all null-session events for a user into one synthetic session; change the logic
in the joined CTE to either exclude rows where sp.ga_session_id IS NULL from the
session aggregation (i.e., add WHERE sp.ga_session_id IS NOT NULL before
grouping) or make null sessions unique per event/window by replacing
coalesce(sp.ga_session_id, '') with a distinct fallback (for example using
sp.ga_session_id || '|' || e.event_key or including a truncated
event_timestamp/event_date) so that session_key (the to_hex(md5(...))
expression) no longer conflates unrelated events for the same user_pseudo_id.
| - name: user_pseudo_id | ||
| tests: | ||
| - not_null |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Can GA4 BigQuery export events have a null user_pseudo_id under consent mode?
💡 Result:
Yes, GA4 BigQuery export events can have a null user_pseudo_id when using Google’s Advanced Consent Mode [1][2]. When a user denies consent for analytics_storage (and/or ads_storage) in Advanced Consent Mode, GA4 still collects events in the form of "cookieless pings" [1][3]. In these cases, to comply with the user's privacy choices and avoid storing pseudonymous identifiers, GA4 strips identifiable information from the event [3][4]. Specifically: - user_pseudo_id is set to null [1][5]. - ga_session_id (located in event_params) is typically null or absent [1][2]. - The privacy_info.analytics_storage field will reflect a "No" status for these events [2][4]. This behavior is unique to Advanced Consent Mode [1][3]. In Basic Consent Mode, GA4 tags do not fire at all if consent is denied, meaning no data (and therefore no null-identifier rows) is sent to BigQuery [1][3]. These "orphaned" rows with null identifiers are useful for understanding total traffic volume and consent rates, but they cannot be used for traditional user or session counting because they lack the necessary identifiers to group activity [1][2][4]. Practitioners often filter these rows out at the model layer when calculating sessions or unique users to prevent data corruption [1][5].
Citations:
- 1: https://adriennevermorel.com/notes/ga4-consent-mode-orphaned-events/
- 2: https://www.ga4bigquery.com/how-to-deal-with-consent-mode-in-the-ga4-bigquery-export/
- 3: https://www.insighten.com.au/insights/navigating-consent-mode-in-ga4-bigquery-export
- 4: https://analyticscanvas.com/consent-mode-v2-in-ga4-how-basic-and-advanced-modes-impact-reporting-in-the-api-and-bigquery/
- 5: https://medium.com/@guillaumewagner/understand-the-ga4-bigquery-raw-data-with-consent-mode-activated-99fdf844fa18
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate and show the relevant schema.yml snippet with context
SCHEMA="dbt/bcsd_data/models/staging/ga4/schema.yml"
echo "=== File: $SCHEMA ==="
if [ -f "$SCHEMA" ]; then
nl -ba "$SCHEMA" | sed -n '1,120p'
else
echo "Missing: $SCHEMA"
fi
echo
echo "=== scripts/run_dbt.sh ==="
if [ -f "./scripts/run_dbt.sh" ]; then
nl -ba "./scripts/run_dbt.sh" | sed -n '1,200p'
else
echo "Missing: ./scripts/run_dbt.sh"
fi
echo
echo "=== rg: run_dbt build invocation ==="
rg -n "run_dbt\.sh build|dbt build|dbt test" -S ./scripts ./dbt 2>/dev/null || trueRepository: BCSDLab/KOIN_DATA
Length of output: 180
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate and show the relevant schema.yml snippet with context
SCHEMA="dbt/bcsd_data/models/staging/ga4/schema.yml"
echo "=== File: $SCHEMA ==="
if [ -f "$SCHEMA" ]; then
nl -ba "$SCHEMA" | sed -n '1,120p'
else
echo "Missing: $SCHEMA"
fi
echo
echo "=== scripts/run_dbt.sh ==="
if [ -f "./scripts/run_dbt.sh" ]; then
nl -ba "./scripts/run_dbt.sh" | sed -n '1,200p'
else
echo "Missing: ./scripts/run_dbt.sh"
fi
echo
echo "=== rg: run_dbt build invocation ==="
rg -n "run_dbt\.sh build|dbt build|dbt test" -S ./scripts ./dbt 2>/dev/null || trueRepository: BCSDLab/KOIN_DATA
Length of output: 180
🏁 Script executed:
#!/bin/bash
set -euo pipefail
SCHEMA="dbt/bcsd_data/models/staging/ga4/schema.yml"
echo "=== File: $SCHEMA (first 200 lines) ==="
if [ -f "$SCHEMA" ]; then
awk '{printf "%d\t%s\n", NR, $0}' "$SCHEMA" | sed -n '1,200p'
else
echo "Missing: $SCHEMA"
fi
echo
echo "=== scripts/run_dbt.sh (first 200 lines) ==="
if [ -f "./scripts/run_dbt.sh" ]; then
awk '{printf "%d\t%s\n", NR, $0}' "./scripts/run_dbt.sh" | sed -n '1,200p'
else
echo "Missing: ./scripts/run_dbt.sh"
fi
echo
echo "=== rg: dbt build/test invocation ==="
rg -n "run_dbt\.sh build|dbt build|dbt test" -S ./scripts ./dbt 2>/dev/null || trueRepository: BCSDLab/KOIN_DATA
Length of output: 865
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find the model SQL for stg_ga4__events
echo "=== rg: stg_ga4__events definition ==="
rg -n "stg_ga4__events" dbt/bcsd_data/models -S || true
echo
echo "=== rg: user_pseudo_id usage in GA4 staging ==="
rg -n "user_pseudo_id" dbt/bcsd_data/models/staging/ga4 -S || true
# Show candidate SQL files around where user_pseudo_id is selected
echo
echo "=== Candidate SQL files containing user_pseudo_id ==="
files=$(rg -l "user_pseudo_id" dbt/bcsd_data/models/staging/ga4 -S || true)
for f in $files; do
echo "--- $f (first 200 lines) ---"
awk '{printf "%d\t%s\n", NR, $0}' "$f" | sed -n '1,200p'
echo
doneRepository: BCSDLab/KOIN_DATA
Length of output: 1882
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== dbt_project.yml ==="
if [ -f "dbt/bcsd_data/dbt_project.yml" ]; then
awk '{printf "%d\t%s\n", NR, $0}' dbt/bcsd_data/dbt_project.yml | sed -n '1,220p'
else
echo "Missing: dbt/bcsd_data/dbt_project.yml"
fi
echo
echo "=== rg: 'severity' in dbt project ==="
rg -n "severity:" dbt/bcsd_data -S || true
echo
echo "=== rg: default test severity / config overrides ==="
rg -n "(tests:|test:|test\s*:|on-run-end|fail|warn)" dbt/bcsd_data -S || true
echo
echo "=== stg_ga4__events source SQL macro (ga4_events_source_sql) usage ==="
rg -n "ga4_events_source_sql" dbt/bcsd_data -S || trueRepository: BCSDLab/KOIN_DATA
Length of output: 2032
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== stg_ga4__events.sql (full file) ==="
awk '{print}' dbt/bcsd_data/models/staging/ga4/stg_ga4__events.sql | sed -n '1,200p'
echo
echo "=== Macro: dbt/bcsd_data/macros/ga4_events_source.sql (full file) ==="
awk '{print}' dbt/bcsd_data/macros/ga4_events_source.sql | sed -n '1,220p'
echo
echo "=== rg: null/filter/coalesce for user_pseudo_id in GA4 models ==="
rg -n "user_pseudo_id\s*(is\s+null|=+\s*null|!=\s*null)|coalesce\(\s*user_pseudo_id|ifnull\(\s*user_pseudo_id|nullif\(\s*user_pseudo_id" dbt/bcsd_data/models/staging/ga4 dbt/bcsd_data/models/bronze/ga4 dbt/bcsd_data/models/silver/ga4 -S || true
echo
echo "=== stg/bronze/silver schema.yml for user_pseudo_id tests ==="
for f in dbt/bcsd_data/models/*/ga4/schema.yml; do
echo "--- $f ---"
awk '{print}' "$f" | sed -n '1,120p'
doneRepository: BCSDLab/KOIN_DATA
Length of output: 3385
Relax not_null on stg_ga4__events.user_pseudo_id
- name: user_pseudo_id
tests:
- not_nullGA4 Advanced Consent Mode can generate “cookieless pings” where user_pseudo_id is NULL, and stg_ga4__events.sql selects user_pseudo_id directly (no null-handling). This not_null test can therefore fail ./scripts/run_dbt.sh build.
Change the test to warn (e.g., not_null: { severity: warn }) or remove it from staging.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@dbt/bcsd_data/models/staging/ga4/schema.yml` around lines 16 - 18, The
current schema test enforces not_null on stg_ga4__events.user_pseudo_id which
fails when GA4 Advanced Consent Mode emits cookieless pings; update the test in
schema.yml for the field user_pseudo_id (under the stg_ga4__events model) to
relax the constraint—either remove the not_null test or change it to a warning
(e.g., set not_null: { severity: warn }) so null user_pseudo_id values no longer
fail the build.
| airflow-webserver: | ||
| <<: *airflow-common | ||
| command: webserver | ||
| ports: | ||
| - "8080:8080" | ||
| healthcheck: | ||
| test: ["CMD", "curl", "--fail", "http://localhost:8080/health"] | ||
| interval: 30s | ||
| timeout: 10s | ||
| retries: 5 | ||
| restart: unless-stopped | ||
|
|
||
| airflow-scheduler: | ||
| <<: *airflow-common | ||
| command: scheduler | ||
| restart: unless-stopped |
There was a problem hiding this comment.
Webserver and scheduler can start before airflow-init finishes.
Both services inherit only depends_on: postgres (service_healthy) from x-airflow-common. They don't wait for airflow-init to complete airflow db migrate and admin creation, so the scheduler/webserver may boot against an unmigrated metadata DB and crash-loop on first startup. Add an explicit dependency on airflow-init completing successfully.
🔧 Proposed fix
airflow-webserver:
<<: *airflow-common
command: webserver
+ depends_on:
+ postgres:
+ condition: service_healthy
+ airflow-init:
+ condition: service_completed_successfully
ports:
- "8080:8080" airflow-scheduler:
<<: *airflow-common
command: scheduler
+ depends_on:
+ postgres:
+ condition: service_healthy
+ airflow-init:
+ condition: service_completed_successfully
restart: unless-stopped📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| airflow-webserver: | |
| <<: *airflow-common | |
| command: webserver | |
| ports: | |
| - "8080:8080" | |
| healthcheck: | |
| test: ["CMD", "curl", "--fail", "http://localhost:8080/health"] | |
| interval: 30s | |
| timeout: 10s | |
| retries: 5 | |
| restart: unless-stopped | |
| airflow-scheduler: | |
| <<: *airflow-common | |
| command: scheduler | |
| restart: unless-stopped | |
| airflow-webserver: | |
| <<: *airflow-common | |
| command: webserver | |
| depends_on: | |
| postgres: | |
| condition: service_healthy | |
| airflow-init: | |
| condition: service_completed_successfully | |
| ports: | |
| - "8080:8080" | |
| healthcheck: | |
| test: ["CMD", "curl", "--fail", "http://localhost:8080/health"] | |
| interval: 30s | |
| timeout: 10s | |
| retries: 5 | |
| restart: unless-stopped | |
| airflow-scheduler: | |
| <<: *airflow-common | |
| command: scheduler | |
| depends_on: | |
| postgres: | |
| condition: service_healthy | |
| airflow-init: | |
| condition: service_completed_successfully | |
| restart: unless-stopped |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docker-compose.yml` around lines 58 - 73, The webserver and scheduler
services currently only depend on the database health and can start before the
one-time init job finishes; update both airflow-webserver and airflow-scheduler
service blocks to add an explicit depends_on entry for airflow-init with the
service_completed_successfully condition (e.g., depends_on: airflow-init:
condition: service_completed_successfully) so they wait for the init job
(airflow db migrate + admin creation) to finish before starting; keep the
existing postgres service_healthy dependency as well.
| @@ -0,0 +1,9 @@ | |||
| import os | |||
|
|
|||
| SECRET_KEY = os.environ.get("SUPERSET_SECRET_KEY", "change_me") | |||
There was a problem hiding this comment.
Avoid an insecure default SECRET_KEY.
SECRET_KEY signs sessions and CSRF tokens; falling back to a hardcoded "change_me" means any deployment that forgets to set SUPERSET_SECRET_KEY runs with a publicly known key, allowing session forgery. Fail fast instead of silently defaulting.
🔒 Proposed fix
-SECRET_KEY = os.environ.get("SUPERSET_SECRET_KEY", "change_me")
+SECRET_KEY = os.environ["SUPERSET_SECRET_KEY"]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| SECRET_KEY = os.environ.get("SUPERSET_SECRET_KEY", "change_me") | |
| SECRET_KEY = os.environ["SUPERSET_SECRET_KEY"] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docker/superset/superset_config.py` at line 3, SECRET_KEY currently defaults
to an insecure literal; change behavior in superset_config.py so that SECRET_KEY
is obtained from os.environ["SUPERSET_SECRET_KEY"] (or explicitly validated) and
the process fails fast with a clear error if the variable is missing/empty
instead of falling back to "change_me"; update the SECRET_KEY assignment and add
a check that raises a RuntimeError (or exits) with a descriptive message
referencing SUPERSET_SECRET_KEY so deployments cannot accidentally run with a
public key.
| #!/usr/bin/env bash | ||
| set -euo pipefail | ||
|
|
||
| BACKUP_DIR="${BACKUP_DIR:-backups/postgres}" |
There was a problem hiding this comment.
Rename backup dir env var to use the required BCSD_DATA_ prefix.
BACKUP_DIR does not follow the required configuration variable prefix convention.
Suggested fix
-BACKUP_DIR="${BACKUP_DIR:-backups/postgres}"
+BACKUP_DIR="${BCSD_DATA_BACKUP_DIR:-${BACKUP_DIR:-backups/postgres}}"As per coding guidelines, "**/*.{sh,py,dockerfile,yaml,yml,.env.example}: Use BCSD_DATA_ as the environment variable prefix for all configuration".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| BACKUP_DIR="${BACKUP_DIR:-backups/postgres}" | |
| BACKUP_DIR="${BCSD_DATA_BACKUP_DIR:-${BACKUP_DIR:-backups/postgres}}" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/backup_postgres.sh` at line 4, The BACKUP_DIR env var doesn't follow
the required BCSD_DATA_ prefix; rename the variable to BCSD_DATA_BACKUP_DIR and
update its default assignment (replace
BACKUP_DIR="${BACKUP_DIR:-backups/postgres}" with
BCSD_DATA_BACKUP_DIR="${BCSD_DATA_BACKUP_DIR:-backups/postgres}"), then replace
all uses of BACKUP_DIR in the script with BCSD_DATA_BACKUP_DIR and update any
related docs/env examples to the new variable name.
| COMMAND="${1:-build}" | ||
| shift || true | ||
|
|
||
| docker compose run --rm airflow-webserver bash -lc "cd /opt/airflow/dbt/bcsd_data && dbt ${COMMAND} $*" |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Use a relative dbt project path in the script.
The hardcoded /opt/airflow/dbt/bcsd_data path violates the repository script rule to prefer relative paths in scripts/config.
Suggested fix
-docker compose run --rm airflow-webserver \
- dbt "${COMMAND}" "$@" --project-dir /opt/airflow/dbt/bcsd_data
+DBT_PROJECT_DIR="${BCSD_DATA_DBT_PROJECT_DIR:-dbt/bcsd_data}"
+docker compose run --rm airflow-webserver \
+ dbt "${COMMAND}" "$@" --project-dir "${DBT_PROJECT_DIR}"As per coding guidelines, "**/{scripts,airflow,dbt,config}/**/*.{sh,py,yaml,yml,conf,json}: Prefer relative paths inside scripts and config files".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/run_dbt.sh` at line 7, The docker compose invocation in
scripts/run_dbt.sh hardcodes /opt/airflow/dbt/bcsd_data; change it to use a path
relative to the repository/script location instead: compute the script directory
(use the directory of scripts/run_dbt.sh as the base) and build a relative path
to dbt/bcsd_data, then use that variable in the docker compose run ... bash -lc
"cd <relative-path> && dbt ${COMMAND} $*"; update the docker compose command to
reference this relative-path variable (and keep using the existing COMMAND and
argument expansion).
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify unsafe interpolation pattern exists in script (read-only).
rg -nP --type=sh '\bbash -lc ".*\$\{COMMAND\}.*\$\*' scripts/run_dbt.shRepository: BCSDLab/KOIN_DATA
Length of output: 167
Harden dbt invocation to prevent argument injection and word-splitting.
scripts/run_dbt.sh interpolates unquoted ${COMMAND} and $* into a bash -lc "..." command string, so user-provided whitespace/metacharacters can alter parsing/execution. Pass arguments as argv ("$@") and avoid nested shell interpolation.
Suggested fix
-COMMAND="${1:-build}"
-shift || true
-
-docker compose run --rm airflow-webserver bash -lc "cd /opt/airflow/dbt/bcsd_data && dbt ${COMMAND} $*"
+COMMAND="${1:-build}"
+shift || true
+
+docker compose run --rm airflow-webserver \
+ dbt "${COMMAND}" "$@" --project-dir /opt/airflow/dbt/bcsd_data📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| docker compose run --rm airflow-webserver bash -lc "cd /opt/airflow/dbt/bcsd_data && dbt ${COMMAND} $*" | |
| COMMAND="${1:-build}" | |
| shift || true | |
| docker compose run --rm airflow-webserver \ | |
| dbt "${COMMAND}" "$@" --project-dir /opt/airflow/dbt/bcsd_data |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/run_dbt.sh` at line 7, The current line embeds unquoted ${COMMAND}
and $* into a double-quoted bash -lc string, enabling argument injection and
word-splitting; change the invocation to pass arguments as argv and quote them:
stop interpolating ${COMMAND} and $* inside the command string, use "$COMMAND"
and "$@" (not $*) as positional parameters, call bash -lc with a single-quoted
command that uses exec dbt "$@" (or cd ... && exec dbt "$@"), and pass the
command and remaining args after a -- so the container receives them as argv
rather than being re-parsed by the outer shell.
9a5dd0a to
14dfe1e
Compare
| task_id="dbt_build_ga4", | ||
| bash_command=( | ||
| "cd /opt/airflow/dbt/koin && " | ||
| "dbt build --select tag:ga4 --vars '{lookback_hours: 6}'" |
There was a problem hiding this comment.
dbt에서 중복 알아서 처리 해주나? 이러면 중복 적재 있을거같아서 물어봄
14dfe1e to
6df9e15
Compare
6df9e15 to
66be308
Compare
Summary
Validation
Summary by CodeRabbit
New Features
Documentation
Chores