ENH: Add ExternalDataUpload skill for local developer and AI agent testing content-link upload workflow#6111
Conversation
826ac24 to
1c13696
Compare
da60d98 to
1e287d1
Compare
|
@greptileai, please review so this can be taken out of draft mode. |
This comment was marked as resolved.
This comment was marked as resolved.
|
@thewtex. FYI, I can't get the upload access to work for either of the recommended services. I am using this skill and resources to help me configure the upload mechanisms, but I seem to be running into roadblocks: Pinata Filebase |
1e287d1 to
65b04cd
Compare
|
Matt should take a look now. |
65b04cd to
65f7c84
Compare
|
@hjmjohnson @dzenanz — addressed the Pinata issue in 0e9dd0b.
So contributors without a paid Pinata plan can now configure only |
|
FYI: I can not get the pinning services to work. On both pinata and filebase I get API restrictions requiring a paid account. I was trying to mirror the recent ITKTestingData additions to these external services for redundancy. ┌─[johnsonhj@ENGR-ECE-M030] - [~/src/XXX/ITK_REMOTE_MODULES_STABLE] - [2026-05-01 07:55:29]
└─[0] find . -name "*.md5" cd /Users/johnsonhj/src/XXX/ITK_REMOTE_MODULES_STABLE
# Smoke (no --background to surface auth errors immediately):
ls ~/src/XXX/ITKTestingData/CID | head -5 | while read cid; do
ipfs pin remote add --service=itk-filebase --name="$cid" "$cid"
done
find: cd: unknown primary or operator
Error: reason: "FORBIDDEN", details: "The Pinning Service API requires a paid account": 403 Forbidden
Error: reason: "FORBIDDEN", details: "The Pinning Service API requires a paid account": 403 Forbidden
Error: reason: "FORBIDDEN", details: "The Pinning Service API requires a paid account": 403 Forbidden
Error: reason: "FORBIDDEN", details: "The Pinning Service API requires a paid account": 403 Forbidden
Error: reason: "FORBIDDEN", details: "The Pinning Service API requires a paid account": 403 Forbidden |
2e9ffcf to
836338c
Compare
Expected to be large.
Adds Utilities/Maintenance/ExternalDataUpload/ with a Claude Code skill that uploads test data to IPFS under the UnixFS v1 2025 profile, pins on the redundant itk-pinata and itk-filebase remote services, optionally mirrors bytes into an ITKTestingData clone at CID/<cid> (with a 50 MB guard for GitHub's per-file push limit), maintains a new Testing/Data/content-links.manifest index, batch-pins every manifest CID, and normalizes existing .md5 / .sha256 / .cid links by fetching through the gateway templates parsed directly from CMake/ITKExternalData.cmake and re-uploading under the current UnixFS profile. Documents the one-time Kubo + IPFS Desktop setup and references the skill from Testing/Data/README.md.
Add `--background` to both `ipfs-upload.sh` and `content-link-normalize.sh` to submit remote pin requests asynchronously via `ipfs pin remote add --background`. The default remains synchronous (surfaces failures immediately, safer for one-off uploads); `--background` is intended for batch runs where waiting for each remote to reach `pinned` (minutes per file on Filebase) would be impractical. Also dedup remote-pin submission: before calling `ipfs pin remote add`, query `ipfs pin remote ls --status=queued,pinning,pinned` for the CID and skip the add if a pin already exists on that service. This avoids Pinata's `DUPLICATE_OBJECT` (400) error on re-runs of previously uploaded content, and prevents Filebase from accumulating duplicate queue entries. README.md and SKILL.md document the new flag, the synchronous vs. asynchronous tradeoff, and the post-run verification command (`ipfs pin remote ls --status=...`).
Convert the 24 `.md5` content links in
Modules/Filtering/AnisotropicDiffusionLBR/test/{Input,Baseline}/ to
`.cid` links under the UnixFS v1 2025 profile, produced by
`Utilities/Maintenance/ExternalDataUpload/content-link-normalize.sh
--hash-only --background`. Bytes were fetched through the gateway
templates in CMake/ITKExternalData.cmake, verified against each
declared MD5 hash, and re-uploaded; all new CIDs are pinned locally
plus on `itk-pinata` and `itk-filebase`.
Record the 24 new CIDs in Testing/Data/content-links.manifest along
with two additional entries picked up as a `--cid-only` sampling run
(CurvatureAnisotropicDiffusionImageFilter.2.png and warp3D.nii.gz),
both of which re-hashed to identical CIDs — confirming existing `.cid`
links in the tree are already compatible with the 2025 profile.
No test semantics change: `CMake/ITKExternalData.cmake` resolves
`DATA{...}` references by whichever `.md5` / `.sha256` / `.cid` link
sits next to the referenced path, so the filter tests continue to
fetch the same bytes.
In content-link-normalize.sh, the prerequisite warning pre-check was
iterating every sha variant (sha1/224/256/384/512) and requiring GNU
coreutils `*sum` binaries. Two issues:
1. ITK content links in practice are only .md5 (legacy) and .sha512
(current), so warning about missing sha224/sha384 tools was noise.
Narrow the pre-check to md5 and sha512.
2. macOS ships BSD `md5` and `shasum`, not coreutils `md5sum` /
`sha512sum`. Warning on their absence was a false positive for
macOS contributors, and the verification path invoked them by
name ("$tool" "$file") so it would actually fail.
Replace `hash_tool_for_ext` (name-only) with `hash_cmd_for_ext` that
returns a full command line — preferring GNU `md5sum` / `shaNsum`
when present, falling back to `md5 -r` (BSD md5 with md5sum-compatible
output) and `shasum -a NNN` (BSD shasum). `verify_bytes` uses
intentional word-splitting so the multi-word fallback
(e.g. "shasum -a 256") expands to distinct argv entries.
Addresses review at
https://github.com/InsightSoftwareConsortium/ITK/pull/6111/files#r3132434963
Rewrite Documentation/docs/contributing/upload_binary_data.md and data.md to describe the new Kubo + pinning-service workflow driven by Utilities/Maintenance/ExternalDataUpload/ipfs-upload.sh, replacing the obsolete web3.storage / w3cli and content-link-upload.itk.org instructions. Document the one-time Kubo + itk-pinata / itk-filebase setup, the upload script's behavior (CIDv1 under the UnixFS v1 2025 profile, synchronous vs. --background pinning, manifest update), the optional --testing-data-repo mirror step with the 50 MB GitHub limit, and the content-link-normalize.sh conversion workflow for legacy .md5 / .sha256 / .sha512 links. Refresh the storage-location list and testing-data figure caption to match the gateways enumerated in CMake/ITKExternalData.cmake, and remove the now-orphaned content-link-upload.png screenshot of the retired web app.
Pinata's `pin remote add` endpoint (the IPFS Pinning Service API) is gated to paid plans — the free plan rejects pin-by-CID with PAID_FEATURE_ONLY (HTTP 403), as reported by @hjmjohnson while exercising the new ExternalDataUpload skill. Filebase's free tier still accepts PSA pin-by-CID, so it remains the baseline pin provider for contributors who don't have a paid Pinata account. ipfs-upload.sh now splits its remote-pinning configuration into a required list (`itk-filebase`) and an optional list (`itk-pinata`): the script aborts if Filebase isn't registered, but logs an informational notice and continues if Pinata isn't. The remote-pin loop walks the merged ACTIVE_SERVICES list so Pinata is still pinned to whenever it is configured. The reorder also surfaces Filebase first in every user-facing list (storage locations, log lines, manifest-skipped warnings, README setup section, contributor docs) to match the new "required first, optional second" hierarchy. Documentation in README.md, SKILL.md, Documentation/docs/contributing/ upload_binary_data.md, and Documentation/docs/contributing/data.md is updated to reorder Filebase ahead of Pinata, mark Pinata as optional, and explain the paid-plan restriction. README.md gains a troubleshooting entry for the PAID_FEATURE_ONLY error pointing at `ipfs pin remote service rm itk-pinata` as the cleanest fix when no paid plan is available. Agent-Session-Id: 40f8eba4-dc94-4d4f-94bd-ff3d2fccf04f Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drops the local Kubo / IPFS-Desktop daemon, the `ipfs config profile
apply unixfs-v1-2025` setup step, the `ipfs pin remote service add`
PSA registrations (`itk-filebase`, `itk-pinata`), and the bash upload
trio (`ipfs-upload.sh`, `content-link-normalize.sh`, `ipfs-pin-all.sh`)
that drove them. The new contributor flow is pure Python on top of a
small pixi environment:
1. `npx ipfs-car pack <file> --no-wrap` builds a CARv1 locally.
ipfs-car v1+ defaults (1 MiB chunks, 1024 children, raw leaves,
CIDv1) match the unixfs-v1-2025 / IPIP-0499 profile, so no extra
flags are needed to produce a reproducible CID.
2. `boto3` PUTs the CAR to a Filebase IPFS bucket through Filebase's
S3-compatible REST API with `x-amz-meta-import: car`. Filebase
imports the CAR server-side and exposes the resulting CID via
`head_object` metadata.
3. The local CID and the CID Filebase reports are compared, and on
success the file is replaced with `<file>.cid`, the manifest at
`Testing/Data/content-links.manifest` is updated, and the optional
`--testing-data-repo` mirror step still copies the bytes into a
local ITKTestingData clone (subject to the same 50 MB GitHub push
limit as before).
Concretely:
- Add `boto3`, `nodejs`, and `requests` to a new
`[tool.pixi.feature.external-data-upload]` feature plus an
`external-data-upload` environment in `pyproject.toml`. Run
`pixi install -e external-data-upload` once, then
`pixi run -e external-data-upload python ...` for every upload.
- New `Utilities/Maintenance/ExternalDataUpload/upload.py` is the
single-file uploader: input validation (in-repo, no whitespace, not
already a content link), CAR build, boto3 put_object with the
`import: car` metadata header, head_object CID round-trip, manifest
update, optional ITKTestingData mirror, and the same `git rm` /
`git add` instructions as before.
- New `Utilities/Maintenance/ExternalDataUpload/normalize.py` parses
`ExternalData_URL_TEMPLATES` from `CMake/ITKExternalData.cmake` with a
paren-aware scanner (the `%(hash)` / `%(algo)` substrings break naive
`re.DOTALL` lazy matching), fetches each `.md5` / `.shaNNN` / `.cid`
link via the gateway templates, verifies the bytes
algorithmically (or via the `/ipfs/` server-side guarantee for
CID links), and re-uploads through `upload.upload_file_to_filebase`.
- `Utilities/Maintenance/ExternalDataUpload/README.md` is rewritten end
to end: pixi setup, Filebase S3-key creation, `FILEBASE_ACCESS_KEY` /
`FILEBASE_SECRET_KEY` / `FILEBASE_BUCKET` env-var contract, new
troubleshooting section (missing npx, missing credentials, Filebase
did not return a CID, CID mismatch).
- `Utilities/Maintenance/ExternalDataUpload/SKILL.md` updated to
describe the same flow for the AI agent: pixi env + Filebase
credentials prerequisites; no Kubo, no PSA service registration.
- `Documentation/docs/contributing/upload_binary_data.md` and
`Documentation/docs/contributing/data.md` rewrite the
one-time-setup, upload-a-file, mirror, and normalize sections
around the pixi + Filebase workflow. The storage-locations list and
testing-data-figure caption are reworded so Filebase appears as the
upload destination and Kubo / Pinata only show up as build-time read
paths (gateways, not pinning targets).
- `Testing/Data/content-links.manifest` header rewritten to credit
`upload.py` as the maintainer (previously named
`ipfs-upload.sh`).
The Filebase free tier supports the S3 import-as-CAR path used here,
so the workflow needs no paid subscription — addressing the original
Pinata \`PAID_FEATURE_ONLY\` blocker reported by @hjmjohnson — and CI
runners can use the same env-var contract via GitHub Actions secrets.
We upload data to filebase storage. Add their gateway.
data.kitware.com are only there for older external and remote modules that might use MD5 or SHA512 hashes.
normalize.py is meant to convert existing .cid / .md5 / .shaNNN content
links, not to upload raw data files for the first time. The previous
single-file path in `enumerate_links` accepted any regular file and
passed it through to the main loop, where `link.read_text()` decoded
the contents as UTF-8 to extract the embedded hash/CID — so passing a
binary file (e.g. `normalize.py ./cthead1.png`) crashed with
`UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89`.
Validate the extension up front and emit a clear error pointing the
user at `upload.py` for raw uploads:
ERROR: cthead1.png is not a content link
(extension must be one of: cid, md5, sha1, sha224, sha256,
sha384, sha512).
normalize.py converts existing .md5/.shaNNN/.cid links; to
upload a raw file for the first time, use upload.py:
pixi run -e external-data-upload python \
Utilities/Maintenance/ExternalDataUpload/upload.py \
cthead1.png
Directory inputs already filtered by `CONTENT_LINK_EXTS` via the
rglob branch, so this only changes behavior for the single-file
positional argument.
Agent-Session-Id: 40f8eba4-dc94-4d4f-94bd-ff3d2fccf04f
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Filebase echoes the imported root CID via the `x-amz-meta-cid` response
header on the synchronous `put_object` call for sub-5 GB CARs, not via
the `head_object` user-metadata dict the original code consulted. boto3's
`put_object` response model does not declare a `Metadata` field, so
user-metadata headers stay in `ResponseMetadata.HTTPHeaders` rather than
being promoted into a dict — leading to a spurious
`RuntimeError: Filebase did not return a CID for <file>.car` even when
the import had completed successfully.
Introduce `_cid_from_response()` that checks both `Metadata["cid"]` /
`Metadata["CID"]` and the case-insensitive `x-amz-meta-cid` header in
`ResponseMetadata.HTTPHeaders`. `upload_car_to_filebase()` now tries
the `put_object` response first (where Filebase actually puts the CID
synchronously), then falls back to `head_object` for older Filebase
configurations or boto3 client setups that strip the user-metadata
header off the PUT response. On double-miss, both responses'
`Metadata` and `HTTPHeaders` are dumped to stderr so the caller can
diagnose whether the issue is a header-name drift, async-import
delay, or a non-IPFS bucket misconfiguration before the
`RuntimeError("Filebase did not return a CID for ...")` is raised.
Tested manually against an `ipfs.filebase.io` bucket — the previously
failing `upload.py ./cthead1.png` invocation now completes and writes
`./cthead1.png.cid` with the round-trip-verified CID.
Agent-Session-Id: 40f8eba4-dc94-4d4f-94bd-ff3d2fccf04f
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
sync.py walks <ITKTestingData>/CID/ and, for each file whose CID (the filename) is absent from Testing/Data/content-links.manifest yet referenced by a .cid content link in the ITK source tree, uploads the bytes to Filebase and appends a manifest entry. Repair tool for orphaned mirror entries; .cid links in the source tree are not modified.
Add a whitespace guard in update_manifest() since the manifest format uses a single space as the field delimiter and cannot represent paths with whitespace. Six entries sourced from files with space-embedded names in ITKTestingData are excluded. Also sync new content-link entries from ITKTestingData mirror via sync.py, covering datasets for Registration/Montage, MetaIO, and various Filtering modules.
… tests Sync additional content-link entries discovered by sync.py for Colormap, Convolution, and LabelMap filter test data.
Additional manifest entries discovered by the running sync.py process.
Via:
pixi run -e external-data-upload python \
Utilities/Maintenance/ExternalDataUpload/sync.py \
--testing-data-repo ~/data/ITKTestingData
In the web interface.
Added for the external-data-upload environment.
Path.rename (-> os.rename) refused to move the fetched bytes from /tmp (often tmpfs) into the source tree with EXDEV "Invalid cross-device link". Use shutil.move, which copy-then-unlinks across filesystems. Path.relative_to(REPO_ROOT) raised ValueError when the user passed a relative target (e.g. "Modules") because the walked link paths were relative but REPO_ROOT is absolute. Call .resolve() first.
Generated after: pixi run -e external-data-upload python Utilities/Maintenance/ExternalDataUpload/normalize.py Modules
59af568 to
c2bc134
Compare
|
@hjmjohnson I rebased, added all new content links from ITKTestingData to Filebase; we should be good now. |
Testing/Data/README.md still referenced the removed ipfs-upload.sh and the multi-service pinning flow. Replace with the upload.py / Filebase invocation so a contributor following the doc actually has a runnable command. In normalize.py, mirror_to_testing_data re-raised CalledProcessError on git-add failure, aborting the whole batch and leaving the in-flight link in an inconsistent state. Wrap the call in try/except: the Filebase upload already pinned the bytes, so keep writing the new .cid and manifest entry and emit a WARN. sync.py can re-apply the mirror later. Addresses two greptile P1 review comments on PR 6111.
Adds Utilities/Maintenance/ExternalDataUpload/ with a Claude Code skill that uploads test data to IPFS under the UnixFS v1 2025 profile, pins on Filebase remote service, optionally mirrors bytes into an ITKTestingData clone at CID/ (with a 50 MB guard for GitHub's per-file push limit), maintains a new Testing/Data/content-links.manifest index, batch-pins every manifest CID, and normalizes existing .md5 / .sha256 / .cid links by fetching through the gateway templates parsed directly from CMake/ITKExternalData.cmake and re-uploading under the current UnixFS profile. Documents the one-time Kubo + IPFS Desktop setup and references the skill from Testing/Data/README.md.
WIP Todos: