Steppingblocks receives nightly JSON drops from two third‑party talent‑data vendors. Your mission: build a thin, production‑style slice that 1. Ingests the raw JSON into Snowflake staging tables. 2. Transforms & unifies the data with dbt. 3. Orchestrates the process in an Airflow DAG. 4. Supports incremental, deduplicated loads—if rows are unchanged, they are not re‑ingested.
We provide the skeleton; you fill in the business logic. The whole task is deliberately scoped to fit in ~1 hour.
Inside the repo’s data/ folder you will find four sub‑directories:
data/
├── source_a/
│ ├── initial/.json # 2 000 rows
│ └── incremental/.json # 200 changed/new rows
└── source_b/
├── initial/.json # 1 500 rows
└── incremental/.json # 120 changed/new rows
Each JSON record shares this schema:
| Field | Type | Notes |
|---|---|---|
| person_id | string | Natural key; duplicates possible across vendors. |
| updated_at | ISO timestamp | Use as the watermark for incremental logic. |
| vendor | string | Literal "A" or "B" |
| attributes | object | Nested blob of column‑like attributes (flatten this) |
Feel free to ask any questions as you go through the exercise.
Create ingest_and_transform with three tasks:
| Task ID | Purpose |
|---|---|
| stage_raw | Load all JSON files from data/ to Snowflake tables RAW_A and RAW_B. Use COPY INTO … or the Python connector—your call. Ensure repeated runs don’t double‑load existing rows. |
| dbt_run | Execute dbt run for the project in /opt/airflow/dbt. |
| dbt_test | Execute dbt test for the same project. |
Make the DAG idempotent (re‑runs safe) and unscheduled (schedule_interval=None).
Inside dbt/:
- Staging models (stg_source_a.sql, stg_source_b.sql)
- Parse/flatten the JSON blob.
- Core incremental model (person_unified.sql)
- Union both sources, deduplicate on person_id, keeping the most recent updated_at.
- Materialise as incremental with unique_key='person_id' and incremental_strategy='merge'.
- Tests
- not_null and unique on person_id.
- One additional assertion of your choice.
- Run #1 with only the initial folders present.
- Run #2 after copying the incremental files in place.
| Prerequisite | Purpose | Quick link |
|---|---|---|
| Docker Desktop | Runs the Airflow + dbt container. | https://www.docker.com/products/docker-desktop |
| git | Clone this repo. | https://git-scm.com/downloads |
| Snowflake 30‑day Trial | Target warehouse & DB (no card needed). | https://signup.snowflake.com |
Within snowflake, make sure to run the following
CREATE WAREHOUSE IF NOT EXISTS DEMO_WH AUTO_SUSPEND = 60;
CREATE DATABASE IF NOT EXISTS DEMO_DB;
CREATE SCHEMA IF NOT EXISTS DEMO_DB.PUBLIC;
# 1· Clone the repo
git clone git@github.com:steppingblocks/data-science-coding-challenge.git
cd data-eng-mini-exercise
# 2· Create & fill your creds
cp .env.example .env # edit with your Snowflake trial values
# 3· Start the stack (first run pulls images, ~3‑5 min)
docker compose pull
docker compose up --build
# 4· Open Airflow: http://localhost:8080 (default login needs to be obtained by getting it from the container)
docker compose exec airflow bash cat /opt/airflow/simple_auth_manager_passwords.json.generated
Snowflake connection
In Airflow UI → Admin → Connections → “+”:
• Conn Id: snowflake_default
• Conn Type: Snowflake
• Fill account, user, password, role, database, schema, warehouse → Save.
That’s it—trigger the DAG and watch it run!
.
├── dags/ # Airflow DAG(s)
├── dbt/ # dbt project
│ ├── models/
│ └── profiles.yml.example
├── data/ # JSON drops (already included)
├── scripts/ # Helper folder that can be ignored
│ └── generate_synthetic_data.json
├── Dockerfile # Extends Airflow image with dbt + Snowflake libs
├── docker-compose.yml
├── .env.example
└── README.md # ← this file
To be completed on call. The time allocated is ~1 hour. Polishing the code is nice, but focus on functionality.
There are many criteria evaluating the codebase, however, here are a few important ones that we look for:
| Dimension | Key questions |
|---|---|
| Correctness | Do row counts & increments match spec? Is the DAG rerunnable? |
| Incremental logic | Does the dbt model use MERGE or equivalent correctly? |
| Orchestration design | Clear task boundaries, retries, logging? |
| Tests & documentation | Are basic dbt tests present and passing? |
| Code hygiene | Clean commits, no hard‑coded creds, readable SQL/Python? |