Skip to content

Commit 3e429ff

Browse files
authored
Merge pull request #2 from MyFriendBen/0004-data-pipeline
RFC: Data pipeline (0004) Approved
2 parents a7516cb + b1b3bf6 commit 3e429ff

File tree

1 file changed

+276
-0
lines changed

1 file changed

+276
-0
lines changed

rfcs/0004-data-pipeline/README.md

Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
# RFC 0004: Multi-Tenant Analytics Platform for MFB Screening Data
2+
3+
Status: Draft
4+
Author: Katie Brey
5+
Created: 2025-09-17
6+
PR: TODO
7+
8+
## Summary
9+
10+
We propose an MFB Analytics Platform that replaces hard-to-maintain materialized views with a engineer-friendly dbt-based data transformation pipeline. The new system provides reliable, testable analytics for global MFB insights and secure multi-tenant dashboards for white-label partners like North Carolina. Built with dbt, Grafana, and Terraform, it implements PostgreSQL row-level security to automatically enforce data isolation without application-level filtering.
11+
12+
## Background
13+
14+
MyFriendBen currently uses a set of [materialized views](https://github.com/MyFriendBen/data-queries) to provide analytics data. These views aggregate screening data from the Django application's PostgreSQL database for our Looker Studio dashboards. However, our current approach has significant maintainability and scalability challenges.
15+
16+
The materialized views are difficult to modify because they have complex interdependencies - changing one view often requires dropping and recreating multiple dependent views in the correct order. There's no systematic way to test data transformations, making it risky to implement changes. Version control of the view definitions is manual and error-prone. When views need to be recreated, engineers must manually determine the dependency order and execute SQL scripts in sequence.
17+
18+
These limitations are particularly problematic as analytics requirements evolve. Global analytics need to show analytics across white labels, while white-label partners like North Carolina require secure, isolated access to only their data. The current materialized view approach lacks automated data isolation mechanisms, and makes it difficult to make updates to our analytics logic as we grow.
19+
20+
## Proposal
21+
22+
Replace materialized views with a dbt-based data transformation pipeline that provides dependency management, automated testing, and version control. Implement PostgreSQL row-level security to enable both global analytics (priority) and secure multi-tenant access for white-label partners.
23+
24+
Dbt defines data transformations as code with automatic dependency resolution, built-in testing capabilities, and git-based version control. For multi-tenancy, PostgreSQL RLS policies automatically filter data based on database user credentials, eliminating the need for application-level filtering and reducing the risk of data leakage.
25+
26+
Grafana provides a flexible, open-source dashboarding solution that integrates well with PostgreSQL and supports multi-tenant access through separate database connections. Terraform automates the provisioning of Grafana datasources and dashboards, making it easy to manage configurations for multiple tenants.
27+
28+
### User Experience
29+
30+
**MFB Administrators**:
31+
32+
- Access global dashboards showing cross-tenant analytics
33+
- Manage tenant access through Terraform configuration updates
34+
35+
**State Partners (Tenant Users)**:
36+
37+
- Access Grafana dashboards using tenant-specific credentials
38+
- View analytics filtered automatically to their state's data
39+
40+
**Engineers**:
41+
42+
- Create analytics tables using dbt with automatic RLS via post-hooks
43+
- Extend dashboards by modifying JSON templates in the `grafana/dashboards/` directory
44+
- Add new tenants by updating `terraform.tfvars` and running `terraform apply`
45+
- Connect to local grafana instance via docker, allowing for easy dashboard development and testing without needing to provide credentials.
46+
47+
### Implementation
48+
49+
The platform consists of three main components working together to provide secure, multi-tenant analytics:
50+
51+
**1. dbt Data Transformation Layer**
52+
53+
- Transforms raw Django model data into clean, analytics-ready tables
54+
- Uses staging, intermediate, and mart models to structure transformations
55+
- Built-in dependency management to ensure correct build order
56+
- Includes automated tests to validate data quality and transformation logic
57+
- Implements row-level security policies automatically via post-hooks
58+
- Provides macros for creating tenant-specific database users with appropriate RLS settings
59+
- Maintains data lineage and documentation for all transformations
60+
61+
**2. Grafana Visualization Layer**
62+
63+
- Provides interactive dashboards for screening analytics
64+
- Uses tenant-specific database connections to enforce data isolation
65+
- Configurable through JSON templates with variable substitution
66+
67+
**3. Terraform Infrastructure Layer**
68+
69+
- Automates provisioning of Grafana datasources and dashboards
70+
- Manages tenant configurations and database credentials securely
71+
- Enables easy addition of new tenants through configuration updates
72+
- Maintains infrastructure state for reproducible deployments
73+
74+
### Technical Details
75+
76+
**Model Stucture**:
77+
Dbt uses a layered approach with staging models to clean raw data, intermediate models for complex transformations, and mart models for final analytics tables. Each layer builds on the previous one, ensuring modularity and maintainability.
78+
79+
**Row-Level Security Implementation**: Each analytics table includes a post-hook that calls the `setup_white_label_rls()` macro. This macro creates PostgreSQL policies that automatically filter rows based on the current user's `rls.white_label_id` setting. Admin users are created with `BYPASSRLS` privilege to access all data.
80+
81+
**Tenant User Management**: The `create_rls_user()` macro automates creation of database users with appropriate permissions and RLS settings. Regular users get their `white_label_id` set as a user-level configuration, while admin users bypass RLS entirely.
82+
83+
**Materialization Strategy**: dbt has several [materialition strategies](https://docs.getdbt.com/docs/build/materializations). By default, models are regular views. Some pieces of advice from dbt:
84+
85+
- Generally start with views for your models, and only change to another materialization when you notice performance problems.
86+
- Use the table materialization for any models being queried by BI tools, to give your end user a faster experience
87+
- Consider materialized views for use cases where incremental models are sufficient, but you would like the data platform to manage the incremental logic and refresh.
88+
89+
Based on these pieces of advice, I would suggest we use views for staging and intermediate models, and tables for final marts that will be queried by Grafana.
90+
91+
### Code Examples
92+
93+
#### Models
94+
95+
A dbt staging model to provide unique screens, `stg_screens.sql`:
96+
97+
```sql
98+
-- dbt staging model for unique screens
99+
{{
100+
config(
101+
materialized='view'
102+
)
103+
}}
104+
105+
-- Remove duplicates: Some records can share the same `uuid` due to pulling validations.
106+
-- We keep a single row per `uuid`, choosing the most recent `submission_date`
107+
-- (tie-broken by highest `id`). This guarantees unique uuids for downstream
108+
-- models and satisfies the uniqueness tests.
109+
WITH filtered AS (
110+
SELECT
111+
id,
112+
uuid,
113+
completed,
114+
submission_date,
115+
start_date,
116+
white_label_id,
117+
household_size,
118+
household_assets,
119+
housing_situation,
120+
zipcode,
121+
county,
122+
is_test,
123+
is_test_data
124+
FROM {{ source('django_apps', 'screener_screen') }}
125+
WHERE
126+
-- Only include completed screeners
127+
completed = true
128+
-- Filter out test data (check both is_test and is_test_data)
129+
AND (is_test = false OR is_test IS NULL)
130+
AND (is_test_data = false OR is_test_data IS NULL)
131+
-- Only include records with submission dates
132+
AND submission_date IS NOT NULL
133+
), deduped AS (
134+
SELECT
135+
*,
136+
ROW_NUMBER() OVER (
137+
PARTITION BY uuid
138+
ORDER BY submission_date DESC, id DESC
139+
) AS row_num
140+
FROM filtered
141+
)
142+
143+
SELECT
144+
id,
145+
uuid,
146+
completed,
147+
submission_date,
148+
start_date,
149+
white_label_id,
150+
household_size,
151+
household_assets,
152+
housing_situation,
153+
zipcode,
154+
county,
155+
is_test,
156+
is_test_data,
157+
-- Add date parts for easier aggregation
158+
DATE(submission_date) as submission_date_only,
159+
EXTRACT(YEAR FROM submission_date) as submission_year,
160+
EXTRACT(MONTH FROM submission_date) as submission_month,
161+
EXTRACT(DAY FROM submission_date) as submission_day,
162+
EXTRACT(DOW FROM submission_date) as submission_day_of_week
163+
FROM deduped
164+
WHERE row_num = 1
165+
166+
SELECT * FROM {{ ref('stg_screens') }}
167+
```
168+
169+
with a schema.yml file to define tests:
170+
171+
```yaml
172+
models:
173+
- name: stg_screens
174+
description: "Staging model for screener data, cleaned and filtered"
175+
columns:
176+
- name: id
177+
description: "Primary key from screener_screen"
178+
tests:
179+
- not_null
180+
- unique
181+
```
182+
183+
#### RLS for white-label data isolation
184+
185+
Macro to set up RLS policies, `setup_white_label_rls.sql`:
186+
187+
```sql
188+
{% macro setup_white_label_rls(table_name, white_label_column='white_label_id', schema_name=none) %}
189+
190+
{% set full_table_name %}
191+
{% if schema_name %}{{ schema_name }}.{{ table_name }}{% else %}{{ target.schema }}.{{ table_name }}{% endif %}
192+
{% endset %}
193+
194+
{% set policy_name %}rls_white_label_{{ table_name }}{% endset %}
195+
196+
-- Enable RLS on the table
197+
ALTER TABLE {{ full_table_name }} ENABLE ROW LEVEL SECURITY;
198+
199+
-- Drop existing policy if it exists
200+
DROP POLICY IF EXISTS {{ policy_name }} ON {{ full_table_name }};
201+
202+
-- Create RLS policy that filters by user's white label setting
203+
CREATE POLICY {{ policy_name }}
204+
ON {{ full_table_name }}
205+
FOR ALL
206+
TO PUBLIC
207+
USING (
208+
{{ white_label_column }} = COALESCE(
209+
NULLIF(current_setting('rls.white_label_id', true), '')::integer,
210+
-999999 -- Deny access if no white_label_id is set
211+
)
212+
);
213+
214+
-- Grant necessary permissions
215+
GRANT SELECT ON {{ full_table_name }} TO PUBLIC;
216+
217+
{% endmacro %}
218+
```
219+
220+
Configure RLS on a mart model, `mart_screen_eligibility.sql`:
221+
222+
```sql
223+
-- dbt mart table model with RLS enabled
224+
{{
225+
config(
226+
materialized='table',
227+
description='Mart model summarizing benefit eligibility for each completed screen',
228+
post_hook="{{ setup_white_label_rls(this.name) }}"
229+
)
230+
}}
231+
```
232+
233+
```bash
234+
# Create tenant user with RLS
235+
dbt run-operation create_rls_user --vars '{"username": "nc", "password": "secure_password", "white_label_access": 1}'
236+
```
237+
238+
#### Materialization configuration
239+
240+
Dbt materialization configuration in `dbt_project.yml`:
241+
242+
```yaml
243+
# configure dbt materialization strategies to view by default, marts as tables.
244+
models:
245+
benefits_dbt:
246+
+materialized: view
247+
marts:
248+
+materialized: table
249+
```
250+
251+
#### Terraform dashboards
252+
253+
```
254+
# Global dashboard
255+
resource "grafana_dashboard" "global" {
256+
depends_on = [grafana_data_source.global_postgres]
257+
258+
config_json = templatefile("${path.module}/../grafana/dashboards/global.json.tpl", {
259+
datasource_uid = grafana_data_source.global_postgres.uid
260+
})
261+
overwrite = true
262+
}
263+
264+
# Tenant-specific dashboards
265+
resource "grafana_dashboard" "tenant_dashboards" {
266+
for_each = var.tenants
267+
depends_on = [grafana_data_source.tenant_postgres]
268+
269+
config_json = templatefile("${path.module}/../grafana/dashboards/tenant.json.tpl", {
270+
tenant_name = each.value.name
271+
tenant_display_name = each.value.display_name
272+
datasource_uid = grafana_data_source.tenant_postgres[each.key].uid
273+
})
274+
overwrite = true
275+
}
276+
```

0 commit comments

Comments
 (0)