Skip to content

Harden COCO dataset bootstrap: HTTPS + SHA-256 verification#2572

Open
kiwaku wants to merge 2 commits into
mlcommons:masterfrom
kiwaku:fix/https-dataset-downloads-and-checksums
Open

Harden COCO dataset bootstrap: HTTPS + SHA-256 verification#2572
kiwaku wants to merge 2 commits into
mlcommons:masterfrom
kiwaku:fix/https-dataset-downloads-and-checksums

Conversation

@kiwaku
Copy link
Copy Markdown

@kiwaku kiwaku commented Apr 9, 2026

Addresses the HTTPS + checksum-verification bullet of #2502.

Changes

  • text_to_image/tools/coco.py, coco_generate_calibration.py, coco_calibration.py: upgrade the hardcoded COCO 2014 annotations URL from http://images.cocodataset.org/... to https://s3.amazonaws.com/images.cocodataset.org/... and verify the downloaded archive against a pinned SHA-256 before it is unzipped.
  • coco.py and coco_calibration.py: per-image urllib.request.urlretrieve calls now rewrite http://images.cocodataset.org/ to the S3 origin so individual image downloads are also TLS-authenticated.
  • Added a small _verify_sha256 helper and a COCO_ANNOTATIONS_TRAINVAL2014_SHA256 module constant to each of the three files.

Why route through s3.amazonaws.com directly instead of fixing the CDN hostname?

images.cocodataset.org terminates HTTPS with an S3 certificate whose subject is s3.amazonaws.com, so a naive http:// to https:// swap fails SNI certificate validation:

curl: (60) SSL: no alternative certificate subject name matches target host name 'images.cocodataset.org'

The path-style S3 URL (https://s3.amazonaws.com/images.cocodataset.org/...) serves the same bytes under a valid certificate, so we use that as the replacement origin. This preserves TLS authentication without depending on cocodataset.org fixing their CDN.

Checksum pinning

COCO_ANNOTATIONS_TRAINVAL2014_SHA256 is the SHA-256 of the canonical 252,872,794-byte archive (computed from the S3 origin at submission time). Mismatch raises RuntimeError before any unzip/read, so a tampered or truncated download cannot poison downstream calibration. If the archive is re-published and the hash rotates, maintainers can re-pin by running curl -sL <URL> | shasum -a 256 and updating the three constants.

Out of scope

cc @tanvi-mlcommons (filed #2502), @arav-agarwal2 (driving the codebase refactor arc)

The text_to_image tooling downloaded COCO 2014 annotations over plain
HTTP with no integrity check, making benchmark calibration vulnerable
to a tampered or truncated archive. Upgrade the annotations URL to the
S3 origin (images.cocodataset.org's CDN has a broken HTTPS cert - the
S3 path-style URL serves the same bytes under a valid certificate) and
pin the expected SHA-256. Mismatch raises before unzip so a bad
download cannot poison downstream benchmark state.

Also rewrites the per-image URLs in the annotations JSON so
urllib.request.urlretrieve downloads are TLS-authenticated.

Addresses the HTTPS + checksum bullet of mlcommons#2502.

Signed-off-by: Kayra Arai Ozturk <kayraaraio@gmail.com>
@kiwaku kiwaku requested a review from a team as a code owner April 9, 2026 14:10
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 9, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@hanyunfan
Copy link
Copy Markdown
Contributor

@manpreetssokhi Could you help to review it

@manpreetssokhi
Copy link
Copy Markdown
Contributor

The links look okay to me in all 3 files, I don't think this change will cause any breaks and is okay to merge.

@kiwaku
Copy link
Copy Markdown
Author

kiwaku commented Apr 15, 2026

recheck

@hanyunfan
Copy link
Copy Markdown
Contributor

@kiwaku Thanks for the contribution. Just want to let you know MLPerf Inference is a members‑only working group. Have you signed the CLA? If you’re already a member and the CLA is in place, I can help coordinate with someone to get you the necessary support.

@kiwaku
Copy link
Copy Markdown
Author

kiwaku commented Apr 15, 2026

@hanyunfan
Thank you, I’m not from a member organization.

I filled out the onboarding form, received the CLA by email, signed it, and sent it back ~6 days ago. The bot is still showing me as unsigned, so its probably not linked to my GitHub account yet.

I’d appreciate any help.

@hanyunfan
Copy link
Copy Markdown
Contributor

@pgmpablo157321 Could you help to take a look at the CLA issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants