Machine Learning project with DVC (Data Version Control)

This project uses the Iris dataset to demonstrate a basic ML pipeline with DVC for data and model versioning.

Requirements

DVC: To track data, model, and pipeline stages
Git: For version control
Scikit-learn: To train a model
Pandas and Joblib

Set up Environment

Create and Activate Virtual ENV

python3 -m venv venv
source venv/bin/activate

Install Dependencies

pip install -r requirements.txt

Initialize Git & DVC

git init

dvc init

Download & Add Dataset to Data directory

mkdir data

wget -O data/iris.csv https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv

Add Iris dataset with DVC tracking

dvc add data/iris.csv

Track the changes with git

git add data/iris.csv.dvc

git commit -m "Add Iris dataset with DVC tracking"

Create DVC Pipeline

Data Preparation Step

dvc stage add -n prepare \
  -d src/prepare.py -d data/iris.csv \
  -o data/X_train.csv -o data/X_test.csv -o data/y_train.csv -o data/y_test.csv \
  python src/prepare.py

Model Training Step

dvc stage add -n train \
  -d src/train.py -d data/X_train.csv -d data/y_train.csv \
  -o model/model.joblib \
  python src/train.py

Evaluate Trained Model

dvc stage add -n evaluate \
  -d src/evaluate.py -d model/model.joblib -d data/X_test.csv -d data/y_test.csv \
  -M metrics.json \
  python src/evaluate.py

Commit Changes

git add dvc.yaml

git commit -m "Add DVC pipeline stages for prepare, train, evaluate"

Run the Full Pipeline

dvc repro

If Pipeline is successful, run the command:

git add dvc.lock

Visualize Metrics

dvc metrics show

Optional:

dvc metrics diff --targets metrics.json

Use Makefile to Automate the entire DVC Pipeline

Install make

sudo apt install make -y

cd <directory with Makefile>

Run Any of the Tasks in the DVC Pipeline

make install     # installs dependencies
make run         # runs the DVC pipeline
make metrics     # shows metrics from metrics.json

Deploy the Trained Model w/ Gradio + Joblib

pip install gradio

Already installed earlier in requirements.txt

python app/gradio_app.py

On your browser, open:

http://localhost:7860

Now, Let's Simulate Data Change

Take Backup & Modify the Original Dataset

cp data/iris.csv data/iris.csv.bak

Append a fake row (synthetic)

echo "4.4,5.6,2.9,1.8,synthetic_class" >> data/iris.csv
echo "4.0,5.0,2.0,1.0,synthetic_class" >> data/iris.csv

Check iris.csv to see modification

Track the Data Change with DVC

dvc add data/iris.csv

Commit changes

git add data/iris.csv.dvc
git commit -m "Modified iris dataset with synthetic sample"

Re-run Pipeline

dvc repro

Track Changes with Git

git add model/model.joblib dvc.lock metrics.json
git commit -m "Retrained model with updated data"

Run Experiments without committing

dvc exp run

List experiments

dvc exp show

q to Exit

Save the best one & commit it

dvc exp apply <exp_id>

git add .
git commit -m "Applied best model experiment"

Let Simulate Rollback

Run New Experience & Apply Best Experiment

dvc exp run

dvc exp show

dvc exp apply <exp_id>
git add .
git commit -m "Applied better model from experiment"

Get Git Logs and Checkout Old Commit

git log --oneline

Copy the commit hash you want to roll back to:

git checkout <old_commit_hash>

Roll back all DVC-tracked files (data/model/metrics)

dvc checkout

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
app		app
src		src
Makefile		Makefile
ReadMe.md		ReadMe.md
requirements.txt		requirements.txt

iQuantC/DVC_ML_Project

Folders and files

Latest commit

History

Repository files navigation