SpurBreast

This repository contains the SpurBreast dataset, a curated real-world breast MRI dataset specifically designed to study spurious correlations in medical imaging. SpurBreast extends the DUKE Breast Cancer dataset with carefully constructed training/validation splits that deliberately introduce or avoid spurious correlations. The dataset was proposed in the paper SpurBreast: A Curated Dataset for Investigating Spurious Correlations in Real-world Breast MRI Classification, published at the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2025).

(Link to the paper)

Poster

Visual Abstract

General Information

Datasets can be downloaded using this link, and should be extracted into the /data folder.

In the /src folder, the file cls_dataset.py contains a bare-bones implementation of a PyTorch dataset that you can use directly.
This file provides the SpurBreastDataset class, which can be initialized as follows:

from cls_dataset import SpurBreastDataset

data_folder = '../data'
spurious_feature = 'field_strength'
aug = ...  # PyTorch transformations

tr_dataset = SpurBreastDataset(data_folder, spurious_feature, 'training', aug)

val_dataset = SpurBreastDataset(data_folder, spurious_feature, 'validation', aug)

ts_dataset = SpurBreastDataset(data_folder, spurious_feature, 'test', aug)

Using this dataset, you can train your own model by specifying one of the following spurious features:

features = ['field_strength', 'menopause', 'race_and_ethnicity', 'surgery_type', 'vertical_flip']

Out of the three datasets (train, validation, test), the training and validation splits contain spurious correlations, while the test split does not. This setup lets you evaluate your model on data that shares the same spurious correlations as the training set as well as on data free of those correlations.

Our experiments show that field_strength and vertical_flip (vertical orientation) introduce the strongest spurious signals. Other features (menopause, race_and_ethnicity, surgery_type) have weaker or minimal effects.

We also provide several splits that are created with patient-based random sampling which do not contain spurious correlations. For those, you can call the same class with one of the following splits.

baseline_splits = ['baseline_high', 'baseline_mid', 'baseline_low']

License

Please be mindful that the DUKE dataset is licensed under Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

BY – Attribution: You must give credit to the original creator.
NC – NonCommercial: You can’t use it for commercial purposes.

As such, our extension, SpurBreast is also licensed using the same terms.

Contact

This work is a research output from Ghent University, Belgium and Ghent University Global Campus, South Korea.

For inquiries, please create an issue or contact Utku Ozbulak: utku.ozbulak@ghent.ac.kr

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
example_images		example_images
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpurBreast

Poster

Visual Abstract

General Information

License

Contact

About

Uh oh!

Releases

Packages

Languages

utkuozbulak/SpurBreast

Folders and files

Latest commit

History

Repository files navigation

SpurBreast

Poster

Visual Abstract

General Information

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages