Harden COCO dataset bootstrap: HTTPS + SHA-256 verification#2572
Harden COCO dataset bootstrap: HTTPS + SHA-256 verification#2572kiwaku wants to merge 2 commits into
Conversation
The text_to_image tooling downloaded COCO 2014 annotations over plain HTTP with no integrity check, making benchmark calibration vulnerable to a tampered or truncated archive. Upgrade the annotations URL to the S3 origin (images.cocodataset.org's CDN has a broken HTTPS cert - the S3 path-style URL serves the same bytes under a valid certificate) and pin the expected SHA-256. Mismatch raises before unzip so a bad download cannot poison downstream benchmark state. Also rewrites the per-image URLs in the annotations JSON so urllib.request.urlretrieve downloads are TLS-authenticated. Addresses the HTTPS + checksum bullet of mlcommons#2502. Signed-off-by: Kayra Arai Ozturk <kayraaraio@gmail.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
|
@manpreetssokhi Could you help to review it |
|
The links look okay to me in all 3 files, I don't think this change will cause any breaks and is okay to merge. |
|
recheck |
|
@kiwaku Thanks for the contribution. Just want to let you know MLPerf Inference is a members‑only working group. Have you signed the CLA? If you’re already a member and the CLA is in place, I can help coordinate with someone to get you the necessary support. |
|
@hanyunfan I filled out the onboarding form, received the CLA by email, signed it, and sent it back ~6 days ago. The bot is still showing me as unsigned, so its probably not linked to my GitHub account yet. I’d appreciate any help. |
|
@pgmpablo157321 Could you help to take a look at the CLA issue. |
Addresses the HTTPS + checksum-verification bullet of #2502.
Changes
text_to_image/tools/coco.py,coco_generate_calibration.py,coco_calibration.py: upgrade the hardcoded COCO 2014 annotations URL fromhttp://images.cocodataset.org/...tohttps://s3.amazonaws.com/images.cocodataset.org/...and verify the downloaded archive against a pinned SHA-256 before it is unzipped.coco.pyandcoco_calibration.py: per-imageurllib.request.urlretrievecalls now rewritehttp://images.cocodataset.org/to the S3 origin so individual image downloads are also TLS-authenticated._verify_sha256helper and aCOCO_ANNOTATIONS_TRAINVAL2014_SHA256module constant to each of the three files.Why route through
s3.amazonaws.comdirectly instead of fixing the CDN hostname?images.cocodataset.orgterminates HTTPS with an S3 certificate whose subject iss3.amazonaws.com, so a naivehttp://tohttps://swap fails SNI certificate validation:The path-style S3 URL (
https://s3.amazonaws.com/images.cocodataset.org/...) serves the same bytes under a valid certificate, so we use that as the replacement origin. This preserves TLS authentication without depending on cocodataset.org fixing their CDN.Checksum pinning
COCO_ANNOTATIONS_TRAINVAL2014_SHA256is the SHA-256 of the canonical 252,872,794-byte archive (computed from the S3 origin at submission time). Mismatch raisesRuntimeErrorbefore any unzip/read, so a tampered or truncated download cannot poison downstream calibration. If the archive is re-published and the hash rotates, maintainers can re-pin by runningcurl -sL <URL> | shasum -a 256and updating the three constants.Out of scope
os.systemcalls incoco_calibration.py- those are tracked by the first bullet of Codebase refactor : Fix security and safety issues #2502 (partially addressed by merged Update coco.py to not utilize os.system #2489).text_to_image/tools/- did not scan the whole tree, happy to expand if maintainers want.cc @tanvi-mlcommons (filed #2502), @arav-agarwal2 (driving the codebase refactor arc)