A full-stack ML pipeline to identify heart attack risk using CDC’s 2023 BRFSS dataset. Built with a production-first mindset, this project integrates clean data engineering, interpretable ML, and real-time deployment using AWS services.
Project Outcome: A transparent, explainable, and high-performance heart-risk scoring engine — powered by XGBoost, deployed on SageMaker, and monitored via CloudWatch — ready for insurer use cases like underwriting and population health management.
Heart attacks are preventable yet costly. Traditional risk scoring misses early behavioral and comorbidity signals. This solution leverages public health data to deliver accurate, explainable risk predictions.
- Business Need: Insurance underwriting & preventative care
- Goal: Predict risk of heart attack from lifestyle + clinical factors
- End Product: Interpretable ML model with <1s API latency
- Source: CDC BRFSS 2023
- ~430,000 responses, 350+ fields
- 33 curated features selected post codebook mapping
- Target variable:
cvdinfra4(ever diagnosed with a heart attack)
- Extracted
.xptsurvey files, filtered relevant columns - Cleaned missing/ambiguous codes (
7=DK,9=Refused) - Standardized binary indicators, renamed columns, created modeling-ready dataset
- Dropped columns with excessive nulls
- Applied log + z-score transformation on skewed features (e.g.,
bmi_log_z) - Mapped categorical codes to readable labels using BRFSS codebook
- Identified high-risk groups using positive rate profiling (e.g., widowed, low-income)
- Used Cohen’s d, t-tests, chi-squared tests to rank feature importance
- Bivariate risk profiling for both categorical and numerical features
- Visual checks, outlier detection, and initial pruning
- Selected statistically significant features for modeling
- Pipeline:
log_znumerics + binned features + binary flags - Baseline AUROC: 0.9885, Recall: 1.00
- Optuna tuning: 150+ trials, minor lift (best
scale_pos_weight ≈ 1.2) - Compared with LGBM & Logistic Regression — XGBoost selected for performance + ecosystem
- Used SHAP for global and individual explanations
- Created engineered features:
chronic_burden,preventive_neglect,bmi_risk_code - Validated model robustness after removing dominant feature (
prev_chd_or_mi)
- Model saved as
xgb_top25_shap.jobliband deployed using built-in XGBoost 1.5-1 container - REST API exposed via
ml.m5.largeendpoint - Smoke tested with sample input/output
- Latency consistently <150ms (p95)
- Configured
InvocationLatency > 1salarm via Boto3 - Programmatically triggered and reset on deployment
- No threshold breaches observed in load testing
- Auto-logged: metrics, parameters, Conda env, artifacts
- Compared baseline vs tuned runs
- Enabled one-click rollback & reproducibility
- Stored trained models + test splits in artifact store
- CSV upload → real-time inference via deployed API
- SHAP visualizations (global + per-row with GPT insights)
- XGBoost vs Logistic Regression comparison toggle
- Parameterized DAG: data prep → training → deploy → monitor
- Papermill-executed notebooks, IAM-secured tasks
- Airflow DAG updates model + CloudWatch alarm dynamically
| Metric | Value |
|---|---|
| AUROC | 0.9884 |
| Recall | 1.00 |
| Precision | ~0.66 |
| Inference Time | < 150 ms |
| SHAP Validated | ✅ |
Data & Modeling: Python, Pandas, Scikit-learn, XGBoost, Optuna
Explainability: SHAP, OpenAI GPT API
Deployment: AWS SageMaker, Boto3
Monitoring: AWS CloudWatch, SNS
Tracking: MLflow
Orchestration: Airflow + Papermill
Web UI: Streamlit
