Skip to content

steppingblocks/data_engineering_manager_interview

Repository files navigation

SteppingBlocks Data role coding exercise

Scenario

Steppingblocks receives nightly JSON drops from two third‑party talent‑data vendors. Your mission: build a thin, production‑style slice that 1. Ingests the raw JSON into Snowflake staging tables. 2. Transforms & unifies the data with dbt. 3. Orchestrates the process in an Airflow DAG. 4. Supports incremental, deduplicated loads—if rows are unchanged, they are not re‑ingested.

We provide the skeleton; you fill in the business logic. The whole task is deliberately scoped to fit in ~1 hour.

Data you will use

Inside the repo’s data/ folder you will find four sub‑directories:

data/
├── source_a/
│   ├── initial/.json        # 2 000 rows
│   └── incremental/.json    # 200 changed/new rows
└── source_b/
    ├── initial/.json        # 1 500 rows
    └── incremental/.json    # 120 changed/new rows

Each JSON record shares this schema:

Field Type Notes
person_id string Natural key; duplicates possible across vendors.
updated_at ISO timestamp Use as the watermark for incremental logic.
vendor string Literal "A" or "B"
attributes object Nested blob of column‑like attributes (flatten this)

What to build

Feel free to ask any questions as you go through the exercise.

Airflow DAG

Create ingest_and_transform with three tasks:

Task ID Purpose
stage_raw Load all JSON files from data/ to Snowflake tables RAW_A and RAW_B. Use COPY INTO … or the Python connector—your call. Ensure repeated runs don’t double‑load existing rows.
dbt_run Execute dbt run for the project in /opt/airflow/dbt.
dbt_test Execute dbt test for the same project.

Make the DAG idempotent (re‑runs safe) and unscheduled (schedule_interval=None).

dbt project

Inside dbt/:

  • Staging models (stg_source_a.sql, stg_source_b.sql)
    • Parse/flatten the JSON blob.
  • Core incremental model (person_unified.sql)
    • Union both sources, deduplicate on person_id, keeping the most recent updated_at.
    • Materialise as incremental with unique_key='person_id' and incremental_strategy='merge'.
  • Tests
    • not_null and unique on person_id.
    • One additional assertion of your choice.

Demonstration run

  • Run #1 with only the initial folders present.
  • Run #2 after copying the incremental files in place.

Getting started

Prerequisite Purpose Quick link
Docker Desktop Runs the Airflow + dbt container. https://www.docker.com/products/docker-desktop
git Clone this repo. https://git-scm.com/downloads
Snowflake 30‑day Trial Target warehouse & DB (no card needed). https://signup.snowflake.com

Within snowflake, make sure to run the following

CREATE WAREHOUSE IF NOT EXISTS DEMO_WH AUTO_SUSPEND = 60;
CREATE DATABASE  IF NOT EXISTS DEMO_DB;
CREATE SCHEMA    IF NOT EXISTS DEMO_DB.PUBLIC;

Step-by-step guide

# 1· Clone the repo
git clone git@github.com:steppingblocks/data-science-coding-challenge.git
cd data-eng-mini-exercise

# 2· Create & fill your creds
cp .env.example .env          # edit with your Snowflake trial values

# 3· Start the stack (first run pulls images, ~3‑5 min)
docker compose pull
docker compose up --build

# 4· Open Airflow: http://localhost:8080  (default login needs to be obtained by getting it from the container)

docker compose exec airflow bash cat /opt/airflow/simple_auth_manager_passwords.json.generated


Snowflake connection
In Airflow UI → Admin → Connections → “+”:
    •   Conn Id: snowflake_default
    •   Conn Type: Snowflake
    •   Fill account, user, password, role, database, schema, warehouse → Save.

That’s it—trigger the DAG and watch it run!

Repository layout

.
├── dags/                     # Airflow DAG(s)
├── dbt/                      # dbt project
│   ├── models/
│   └── profiles.yml.example
├── data/                     # JSON drops (already included)
├── scripts/                  # Helper folder that can be ignored
│   └── generate_synthetic_data.json 
├── Dockerfile                # Extends Airflow image with dbt + Snowflake libs
├── docker-compose.yml
├── .env.example
└── README.md                 # ← this file

Deliverables & submission

To be completed on call. The time allocated is ~1 hour. Polishing the code is nice, but focus on functionality.

Evaluation snapshot (what we look for)

There are many criteria evaluating the codebase, however, here are a few important ones that we look for:

Dimension Key questions
Correctness Do row counts & increments match spec? Is the DAG rerunnable?
Incremental logic Does the dbt model use MERGE or equivalent correctly?
Orchestration design Clear task boundaries, retries, logging?
Tests & documentation Are basic dbt tests present and passing?
Code hygiene Clean commits, no hard‑coded creds, readable SQL/Python?

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors