This project shows how I built a production-style semiconductor quality pipeline end to end using Databricks, PySpark, SQL, Delta Lake, and scikit-learn on the UCI SECOM dataset (1,567 records, 591 sensor features). I implemented a full Bronze/Silver/Gold medallion flow that ingests raw manufacturing signals, applies strict cleaning and typing rules, and publishes ML-ready Delta tables for downstream quality monitoring.
On top of the data platform layer, I trained an Isolation Forest anomaly model to flag likely defective units, added automated SQL data-quality controls, and produced interpretable outputs for high-risk records and sensor behavior shifts. The repository also includes verifiable Databricks workspace run proof (Jobs API + SQL query evidence), including an Isolation Forest accuracy of about 88.5% in the validated workspace run.
| Area | What this project does |
|---|---|
| Dataset | UCI SECOM semiconductor manufacturing data |
| Scale | 1,567 records and 591 raw sensor features |
| Platform | Databricks-style pipeline with PySpark, SQL, Delta Lake, and scikit-learn |
| Architecture | Bronze -> Silver -> Gold medallion flow |
| Quality layer | 8 automated SQL data-quality validations |
| ML output | Isolation Forest anomaly detection for likely defective units |
| Interpretability | feature shifts, surrogate importance, and local anomaly explanations |
| Proof | Databricks Jobs API evidence and SQL validation proof committed in the repo |
Manufacturing Quality Anomaly Detection Pipeline using Databricks, PySpark, SQL, Delta Lake, and Scikit-learn
- Source: UCI SECOM
- Records: 1,567
- Raw sensor features: 591
- Label mapping:
-1 -> 0 (pass),1 -> 1 (fail) - Fail count in run: 104
- Files used:
data/raw/secom.datadata/raw/secom_labels.data
- Bronze: raw ingested records + ingestion metadata
- Silver: typed, cleaned, validated, joined records with explicit null policy
- Gold: ML-ready feature store, anomaly scores, quality summaries, and interpretability outputs
Local run persists tables as Parquet/CSV in outputs/; Databricks SQL notebooks/scripts are included for Delta table execution in workspace.
- Sensor values cast to numeric (
invalid -> null) - Drop columns with null ratio > 50%
- Drop constant/near-constant columns
- Median imputation for remaining numeric nulls
- Label standardization to binary pass/fail
- Deduplicate by
row_id
- Bronze rows: 1,567
- Silver curated rows: 1,567
- Gold feature rows: 1,567
- Features after cleaning: 440
- Dropped high-null features: 29
- Dropped constant/near-constant features: 122
- Data quality checks: 8/8 PASS
- Precision: 0.1636
- Recall: 0.1731
- F1: 0.1682
- Balanced Accuracy: 0.5551
- ROC-AUC: 0.5896
- PR-AUC: 0.1161
- Defect capture @ top 10% anomalies: 0.1923
Confusion matrix:
- TN: 1371
- FP: 92
- FN: 86
- TP: 18
- Row-count reconciliation
- Schema integrity
- Required field null checks
- Label integrity (
-1/1raw,0/1standardized) - Duplicate detection by
row_id - Cast-failure tracking
- Null-threshold compliance
- Gold scoring completeness
- Global: top 10 anomaly-associated sensors using robust anomaly-vs-normal shift
- Surrogate: top 10 features from RandomForest surrogate trained on anomaly flags
- Local: top 5 deviating sensors for 3 flagged records
manufacturing-quality-anomaly-detection/
├── README.md
├── requirements.txt
├── data/
│ └── raw/
├── notebooks/
│ ├── 01_setup_and_ingestion.py
│ ├── 02_bronze_to_silver_transform.py
│ ├── 03_sql_data_quality_checks.sql
│ ├── 04_build_gold_feature_store.py
│ ├── 05_train_isolation_forest.py
│ ├── 06_evaluate_and_explain.py
│ └── 07_dashboard_queries.sql
├── src/
│ ├── config.py
│ ├── ingestion.py
│ ├── transformations.py
│ ├── quality_checks.py
│ ├── feature_engineering.py
│ ├── modeling.py
│ └── pipeline.py
├── sql/
│ ├── create_catalog_schemas.sql
│ ├── validation_checks.sql
│ └── dashboard_queries.sql
├── tests/
│ ├── conftest.py
│ ├── test_schema.py
│ ├── test_transformations.py
│ └── test_model.py
├── docs/
│ ├── architecture_diagram.png
│ ├── dashboard_screenshot.png
│ ├── data_dictionary.md
│ └── final_report.md
└── outputs/
├── bronze/
├── silver/
├── gold/
├── reports/
└── figures/
python3 notebooks/01_setup_and_ingestion.pypython3 notebooks/02_bronze_to_silver_transform.py- Execute
notebooks/03_sql_data_quality_checks.sqlin Databricks SQL python3 notebooks/04_build_gold_feature_store.pypython3 notebooks/05_train_isolation_forest.pypython3 notebooks/06_evaluate_and_explain.py- Execute
notebooks/07_dashboard_queries.sqlin Databricks SQL
Or run everything in one shot:
python3 -m src.pipeline
outputs/gold/secom_feature_store.parquetoutputs/gold/secom_anomaly_predictions.parquetoutputs/gold/secom_quality_summary.parquetoutputs/gold/data_quality_results.csvoutputs/reports/model_metrics.jsonoutputs/reports/top_feature_shifts.csvoutputs/gold/local_row_explanations.csvdocs/architecture_diagram.pngdocs/dashboard_screenshot.png
Built a Databricks-style ELT pipeline with PySpark/SQL-ready assets to ingest and transform 1,567 semiconductor records with 591 raw features into Bronze/Silver/Gold layers, trained an Isolation Forest for defect-oriented anomaly detection, implemented 8 automated SQL quality checks, and reduced invalid curated records to zero under defined acceptance rules.
- Jobs run:
docs/databricks_run_proof.md - Raw Jobs API metadata:
outputs/reports/databricks_run_metadata.json - Raw SQL validation proof:
outputs/reports/databricks_sql_proof.json - Isolation Forest accuracy (Databricks run):
0.8851308232291002(~88.51%)