This project uses the Iris dataset to demonstrate a basic ML pipeline with DVC for data and model versioning.
- DVC: To track data, model, and pipeline stages
- Git: For version control
- Scikit-learn: To train a model
- Pandas and Joblib
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtgit initdvc initmkdir datawget -O data/iris.csv https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csvdvc add data/iris.csvgit add data/iris.csv.dvc git commit -m "Add Iris dataset with DVC tracking"dvc stage add -n prepare \
-d src/prepare.py -d data/iris.csv \
-o data/X_train.csv -o data/X_test.csv -o data/y_train.csv -o data/y_test.csv \
python src/prepare.pydvc stage add -n train \
-d src/train.py -d data/X_train.csv -d data/y_train.csv \
-o model/model.joblib \
python src/train.pydvc stage add -n evaluate \
-d src/evaluate.py -d model/model.joblib -d data/X_test.csv -d data/y_test.csv \
-M metrics.json \
python src/evaluate.pygit add dvc.yaml git commit -m "Add DVC pipeline stages for prepare, train, evaluate"dvc reproIf Pipeline is successful, run the command:
git add dvc.lockdvc metrics showOptional:
dvc metrics diff --targets metrics.jsonsudo apt install make -ycd <directory with Makefile>make install # installs dependencies
make run # runs the DVC pipeline
make metrics # shows metrics from metrics.jsonpip install gradioAlready installed earlier in requirements.txt
python app/gradio_app.pyOn your browser, open:
http://localhost:7860cp data/iris.csv data/iris.csv.bakAppend a fake row (synthetic)
echo "4.4,5.6,2.9,1.8,synthetic_class" >> data/iris.csv
echo "4.0,5.0,2.0,1.0,synthetic_class" >> data/iris.csvCheck iris.csv to see modification
dvc add data/iris.csvgit add data/iris.csv.dvc
git commit -m "Modified iris dataset with synthetic sample"dvc reprogit add model/model.joblib dvc.lock metrics.json
git commit -m "Retrained model with updated data"dvc exp rundvc exp showq to Exit
dvc exp apply <exp_id>git add .
git commit -m "Applied best model experiment"dvc exp rundvc exp showdvc exp apply <exp_id>
git add .
git commit -m "Applied better model from experiment"git log --onelinegit checkout <old_commit_hash>dvc checkout