Skip to content

Migrate image datasets to Hugging Face / Kaggle to reduce repo size #15

@mrbeandev

Description

@mrbeandev

Problem

The repository currently stores all reference images (gemini_black/, gemini_white/, gemini_random/) directly in git. As contributors add more images (e.g. gemini_black_nb_pro/, gemini_white_nb_pro/ with 150-200 images each), the repo size will grow significantly - git isn't designed for large binary datasets and every clone will download the full history of these files.

Suggestion

Migrate image datasets to Hugging Face Datasets or Kaggle Datasets:

Option A: Hugging Face Hub (recommended)

  • Create a dataset repo at huggingface.co/datasets/aloshdenny/reverse-synthid-images
  • Organize by folder: gemini_black/, gemini_white/, gemini_black_nb_pro/, etc.
  • Contributors can upload via huggingface_hub CLI or the web UI
  • Easy to load in Python: from datasets import load_dataset
  • Free hosting, versioned, supports large files natively via LFS

Option B: Kaggle Datasets

  • Host at kaggle.com/datasets/aloshdenny/reverse-synthid-images
  • Contributors upload via Kaggle API
  • Good visibility in the ML community

Repo changes needed

  1. Move existing images to the chosen platform
  2. Replace image folders with a download script (e.g. scripts/download_images.py)
  3. Add the dataset link to README
  4. Update contribution guide to point contributors to upload images there instead of PRs

Current repo size concern

The existing gemini_black/ (101 images), gemini_white/ (101 images), and gemini_random/ (88 images) already contribute significantly. With the new nb_pro folders requesting 150-200 images each, plus future model variants, the repo could easily exceed 1-2 GB - making clones slow and CI expensive.

Benefits

  • Faster clones - code-only repo stays small
  • Better for contributors - uploading images to HF/Kaggle is simpler than large git PRs
  • Versioning - HF Hub tracks dataset versions properly
  • Discoverability - datasets on HF/Kaggle get more visibility from the ML research community

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions