SteppingBlocks Data role coding exercise

Scenario

Steppingblocks receives nightly JSON drops from two third‑party talent‑data vendors. Your mission: build a thin, production‑style slice that 1. Ingests the raw JSON into Snowflake staging tables. 2. Transforms & unifies the data with dbt. 3. Orchestrates the process in an Airflow DAG. 4. Supports incremental, deduplicated loads—if rows are unchanged, they are not re‑ingested.

We provide the skeleton; you fill in the business logic. The whole task is deliberately scoped to fit in ~1 hour.

Data you will use

Inside the repo’s data/ folder you will find four sub‑directories:

data/
├── source_a/
│   ├── initial/.json        # 2 000 rows
│   └── incremental/.json    # 200 changed/new rows
└── source_b/
    ├── initial/.json        # 1 500 rows
    └── incremental/.json    # 120 changed/new rows

Each JSON record shares this schema:

Field	Type	Notes
person_id	string	Natural key; duplicates possible across vendors.
updated_at	ISO timestamp	Use as the watermark for incremental logic.
vendor	string	Literal "A" or "B"
attributes	object	Nested blob of column‑like attributes (flatten this)

What to build

Feel free to ask any questions as you go through the exercise.

Airflow DAG

Create ingest_and_transform with three tasks:

Task ID	Purpose
stage_raw	Load all JSON files from data/ to Snowflake tables RAW_A and RAW_B. Use COPY INTO … or the Python connector—your call. Ensure repeated runs don’t double‑load existing rows.
dbt_run	Execute dbt run for the project in /opt/airflow/dbt.
dbt_test	Execute dbt test for the same project.

Make the DAG idempotent (re‑runs safe) and unscheduled (schedule_interval=None).

dbt project

Inside dbt/:

Staging models (stg_source_a.sql, stg_source_b.sql)
- Parse/flatten the JSON blob.
Core incremental model (person_unified.sql)
- Union both sources, deduplicate on person_id, keeping the most recent updated_at.
- Materialise as incremental with unique_key='person_id' and incremental_strategy='merge'.
Tests
- not_null and unique on person_id.
- One additional assertion of your choice.

Demonstration run

Run #1 with only the initial folders present.
Run #2 after copying the incremental files in place.

Getting started

Prerequisite	Purpose	Quick link
Docker Desktop	Runs the Airflow + dbt container.	https://www.docker.com/products/docker-desktop
git	Clone this repo.	https://git-scm.com/downloads
Snowflake 30‑day Trial	Target warehouse & DB (no card needed).	https://signup.snowflake.com

Within snowflake, make sure to run the following

CREATE WAREHOUSE IF NOT EXISTS DEMO_WH AUTO_SUSPEND = 60;
CREATE DATABASE  IF NOT EXISTS DEMO_DB;
CREATE SCHEMA    IF NOT EXISTS DEMO_DB.PUBLIC;

Step-by-step guide

# 1· Clone the repo
git clone git@github.com:steppingblocks/data-science-coding-challenge.git
cd data-eng-mini-exercise

# 2· Create & fill your creds
cp .env.example .env          # edit with your Snowflake trial values

# 3· Start the stack (first run pulls images, ~3‑5 min)
docker compose pull
docker compose up --build

# 4· Open Airflow: http://localhost:8080  (default login needs to be obtained by getting it from the container)

docker compose exec airflow bash cat /opt/airflow/simple_auth_manager_passwords.json.generated


Snowflake connection
In Airflow UI → Admin → Connections → “+”:
    •   Conn Id: snowflake_default
    •   Conn Type: Snowflake
    •   Fill account, user, password, role, database, schema, warehouse → Save.

That’s it—trigger the DAG and watch it run!

Repository layout

.
├── dags/                     # Airflow DAG(s)
├── dbt/                      # dbt project
│   ├── models/
│   └── profiles.yml.example
├── data/                     # JSON drops (already included)
├── scripts/                  # Helper folder that can be ignored
│   └── generate_synthetic_data.json 
├── Dockerfile                # Extends Airflow image with dbt + Snowflake libs
├── docker-compose.yml
├── .env.example
└── README.md                 # ← this file

Deliverables & submission

To be completed on call. The time allocated is ~1 hour. Polishing the code is nice, but focus on functionality.

Evaluation snapshot (what we look for)

There are many criteria evaluating the codebase, however, here are a few important ones that we look for:

Dimension	Key questions
Correctness	Do row counts & increments match spec? Is the DAG rerunnable?
Incremental logic	Does the dbt model use MERGE or equivalent correctly?
Orchestration design	Clear task boundaries, retries, logging?
Tests & documentation	Are basic dbt tests present and passing?
Code hygiene	Clean commits, no hard‑coded creds, readable SQL/Python?

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dags		dags
data		data
dbt		dbt
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.MD		README.MD
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SteppingBlocks Data role coding exercise

Scenario

Data you will use

What to build

Airflow DAG

dbt project

Demonstration run

Getting started

Step-by-step guide

Repository layout

Deliverables & submission

Evaluation snapshot (what we look for)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SteppingBlocks Data role coding exercise

Scenario

Data you will use

What to build

Airflow DAG

dbt project

Demonstration run

Getting started

Step-by-step guide

Repository layout

Deliverables & submission

Evaluation snapshot (what we look for)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages