Reduce worker startup latency: fast env-activation entrypoint, lazy ML imports, slimmer bases#148
Conversation
…lim bases NimbusImage runs a fresh container per job. `conda run -n worker` in the ENTRYPOINT spawned a second Python process, adding ~0.8-1.0s to every invocation, and heavy ML imports were paid even by the lightweight `interface` request. This cuts both. Startup mechanism: - Add workers/base_docker_images/run_worker.sh: activates the worker conda env in-process (PATH + activate.d hooks, preserving GDAL_DATA/PROJ_DATA/ GDAL_DRIVER_PATH) and execs the env python, replacing `conda run -n worker`. Auto-detects miniforge (amd64) / miniconda (arm64). - Bake it into worker-base and image-processing-base as the ENTRYPOINT; the 19 shared-base workers inherit it (per-worker `conda run` removed). Add worker_client to worker-base for parity. - Convert the remaining 28 production Dockerfiles (13 GPU workers + deconwolf + ai_analysis + blob_random_forest_classifier, incl. _M1 variants) to run_worker.sh. Measured (arm64, full startup via --help, median of 5): blob_intensity 1.21->0.41s, crop 1.22->0.44s, blob_metrics ~1.2->0.40s, registration ->0.51s. CPU worker tests pass (blob_metrics 9, registration 20, crop 14, blob_intensity 9, line_scan 7, blob_random_forest 4). GDAL/PROJ env vars verified identical to conda-run. Import deferral: - Move heavy ML imports (torch/tensorflow/cellpose/deeptile/stardist/sam2/ segment_anything/deepcell/piscis) from module load into compute()/helpers across all GPU workers, so the interface request and startup no longer pay them. Also defer condensatenet's model package and sam_fewshot's torchvision NMS patch. - annotation_utilities/annotation_tools.py: import matplotlib lazily (~50ms off nearly every worker). Refactors / cleanup: - laplacian_of_gaussian, line_scan_worker, blob_colony_two_color_intensity_worker build FROM worker-base instead of replicating the base inline (drops stale Kitware/UPennContrast clone + hardcoded x86_64 miniforge); deleted their obsolete Dockerfile_M1, pinned compose to literal Dockerfile, fixed legacy build script. - line_scan_worker __main__ aligned to the canonical match-request dispatch. - Drop unused r-base from both base images; add `conda clean` (worker-base 5.18->4.74GB, image-processing-base 9.65->9.09GB). - Normalize piscis predict/train entrypoints from CRLF to LF. GPU worker changes are static-validated (py_compile + usage analysis) only; they require amd64 build-host validation before deploy (no local CUDA/torch). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM
Extend TODO-002 with the 2026-06-27 continued log: task (a) rollout of run_worker.sh to all 28 remaining production GPU/self-contained Dockerfiles (+ sam_automatic_mask_generator lazy imports), and task (b) image-size wins (#5 r-base removal, #7 conda clean; bases 5.18->4.74GB / 9.65->9.09GB). Updates the status section and remaining-work list (#6 multi-stage prune and GPU build-host validation still open). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5cd5e96cd1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| defaultToolName="Spots" | ||
|
|
||
| # Fast env activation in place of `conda run` (see todo/worker-startup-latency.md) | ||
| COPY ./workers/base_docker_images/run_worker.sh /usr/local/bin/run_worker.sh |
There was a problem hiding this comment.
Keep run_worker inside the Piscis predict context
The Piscis compose file builds this Dockerfile with context: . from workers/annotations/piscis and dockerfile: ./predict/Dockerfile, while Docker COPY sources are resolved from the build context (Dockerfile docs; Compose build docs). Under that context ./workers/base_docker_images/run_worker.sh is absent/outside the build context, so docker compose build predict fails before the predict image is built; the rest of this Dockerfile still uses Piscis-local paths like ./environment.yml, confirming it is meant for the nested context.
Useful? React with 👍 / 👎.
| defaultToolName="Piscis training" | ||
|
|
||
| # Fast env activation in place of `conda run` (see todo/worker-startup-latency.md) | ||
| COPY ./workers/base_docker_images/run_worker.sh /usr/local/bin/run_worker.sh |
There was a problem hiding this comment.
Keep run_worker inside the Piscis train context
The Piscis compose file builds this Dockerfile with context: . from workers/annotations/piscis and dockerfile: ./train/Dockerfile, while Docker COPY sources are resolved from the build context (Dockerfile docs; Compose build docs). Under that context ./workers/base_docker_images/run_worker.sh is absent/outside the build context, so docker compose build train fails before the training image is built; the rest of this Dockerfile still uses Piscis-local paths like ./environment.yml, confirming it is meant for the nested context.
Useful? React with 👍 / 👎.
PR #148's run_worker.sh rollout added a repo-root-relative `COPY ./workers/base_docker_images/run_worker.sh` to the piscis predict/train Dockerfiles, but piscis is the only GPU worker built via compose with `context: .` (the piscis subdir), so that path did not resolve and both piscis images failed to build. The other 12 GPU workers build from repo-root context, where the identical line works. - docker-compose.yaml: build both services from repo-root context (context: ../../.. + dockerfile: ./workers/annotations/piscis/...). - predict/train Dockerfiles: re-root the piscis-local COPYs to ./workers/annotations/piscis/... so they resolve under the new context; the run_worker.sh COPY is correct as-is. - Install a CUDA-enabled jax (jax[cuda12]) so piscis runs on the GPU instead of the CPU jaxlib that `pip install flax` pulls by default. Build-context fix verified on a g4dn GPU host (driver 580): all 13 ML/GPU workers build + smoke-test clean. CUDA-jax change verified on a fresh GPU build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017ovGUQDgkmBmKFYaJ3FbUW
Follow-up to 8748cac. The current piscis is torch-based: the worker entrypoints import only torch, and neither piscis 1.0.0 (PyPI, predict) nor the zjniu/Piscis source (train) declares jax or flax. jax entered the image solely via a leftover `pip install flax` from the old jax-based Piscis. So the jax[cuda12] GPU install added in 8748cac was making an unused library GPU-capable (~14 GB of CUDA wheels per image for nothing). Remove both it and the `pip install flax` holdover; piscis's GPU work goes through torch (cuda.is_available()=True). Keeps the build-context fix from 8748cac. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017ovGUQDgkmBmKFYaJ3FbUW
…h; sweep unused imports Continuation of the worker-startup-latency work (TODO-002). Deferral (17 production entrypoints): heavy libs used only in compute/helpers (pandas, geopandas, scipy, sklearn, mahotas, large_image, rasterio) moved into those functions, each with a comment, so the lightweight `interface` request no longer loads them. Measured --help proxy: blob_random_forest 0.82->0.32s (-0.50s), crop 0.43->0.34s, connect_to_nearest 0.33s, registration 0.39s. - registration: only large_image deferred; StackReg kept at module top (its tests patch entrypoint.StackReg and reference StackReg.* class constants). - stardist/piscis_train (rasterio) + GPU workers: static-validated only; need amd64 build-host runtime validation. Unused-import sweep: removed 143 unused module-level imports across 41 entrypoints (conservative AST: zero name-references, single-line top-level only, kept used names in multi-name lines, skipped conditional imports). Mostly ~0 startup (stdlib/numpy-shadowed/lazy-skimage); real wins were dead imageio (crop/histogram_matching/registration) and scipy.spatial.distance (point_to_nearest_blob_distance). ai_analysis left untouched (deprecation pending). All buildable worker-profile test suites pass (incl. deconwolf 38, registration 20, h_and_e 9 with dead hed2rgb removed). Tracking doc updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM
…porting piscis) The piscis `interface` request only builds a UI dict + lists local model filenames, but it was paying the full ~4s torch import. Both entrypoints AND utils.py had a module-level `from piscis.paths import MODELS_DIR`, and `from piscis.paths import X` runs piscis/__init__.py -> `from piscis.core import Piscis` -> `import torch`. Because both entrypoints `import utils` at module level (and interface() calls utils.list_girder_models), the torch import was unavoidable on the interface path even though torch is deferred in compute(). Fix: define MODELS_DIR as a plain pathlib path (= ~/.piscis/models, mirroring piscis.paths.MODELS_DIR) in utils.py WITHOUT importing piscis, and have both entrypoints use utils.MODELS_DIR. compute()'s deferred piscis/torch imports are unchanged. Verified on a g4dn GPU build (driver 580): - loading entrypoint.py: torch NOT in sys.modules (was loaded before); importtime shows no torch/jax. - interface-path import 0.35-0.41s vs ~4.2s for the old piscis.paths import. - model list unchanged: 20230616/20230709/20230905/20251212/ps_20240419_112256. - no regression: piscis_predict/train build (built=2 failed=0); compute path still imports torch+piscis and sees the GPU. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017ovGUQDgkmBmKFYaJ3FbUW
…GPU validation All 13 GPU workers were build-host validated on 2026-06-28 (g4dn/driver-580): builds + smoke-clean, lazy-import refactor runtime-clean. Records the three piscis fixes found there — run_worker.sh build-context (8748cac), vestigial jax/flax removal (60afa80), and the interface ~4s torch import (bc70262) — and flips the item-#2 "static-validated only" caveat + the piscis MODELS_DIR/__init__ question to RESOLVED. Remaining: a real end-to-end compute run vs live data. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017ovGUQDgkmBmKFYaJ3FbUW
…n summary The parenthetical and the detailed 2026-06-28 log both list three fixes (run_worker.sh build-context error, vestigial jax/flax dependency, torch import on the interface path); the summary line said "Two". Doc-only consistency fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM
Why
NimbusImage runs each worker as a fresh
docker run --rmcontainer per job (images are already local — registry pull is not the cost). Two avoidable costs were paid on every invocation:conda run -n worker …in the ENTRYPOINT spawns a second Python process (thecondaCLI) just to activate the env and exec the target — ~0.8–1.0 s of pure overhead per job (~70% of a light worker's startup).interfacerequest (the interactive, user-facing path) paid multi-second imports it never used.Full audit, measurements, and methodology:
todo/worker-startup-latency.md(TODO-002).What changed
Fast entrypoint. New
workers/base_docker_images/run_worker.shactivates the conda env in-process (PATH +activate.dhooks, preservingGDAL_DATA/PROJ_DATA/GDAL_DRIVER_PATH) and execs the env python — no second interpreter. Auto-detects miniforge (amd64) / miniconda (arm64).worker-base+image-processing-baseas the ENTRYPOINT; the 19 shared-base workers inherit it.worker_clientadded toworker-basefor parity.deconwolf+ai_analysis+blob_random_forest_classifier, incl._M1).Lazy imports. Heavy ML libs deferred into
compute()/helpers across all GPU workers (+sam_automatic_mask_generator);matplotlibdeferred in the sharedannotation_tools.py.Refactors / cleanup.
laplacian_of_gaussian,line_scan_worker,blob_colony_two_color_intensity_workernow buildFROM worker-base(dropping inline base duplication, a staleKitware/UPennContrastclone, and hardcoded x86_64 miniforge); obsolete_M1files deleted.line_scan__main__aligned to the canonical dispatch. Dropped unusedr-base+ addedconda cleanto both bases.Results (arm64, full startup via
--help, median of 5)Base images: worker-base 5.18 → 4.74 GB, image-processing-base 9.65 → 9.09 GB (disk only; does not affect startup).
Validation
conda runreference.py_compile+ import-usage analysis) — no local CUDA/torch. They must be built on the amd64 build host and smoke-tested (imports resolve at runtime, acomputerun works) before deploy.Deferred (not in this PR)
build-essential/git/dev headers (riskier image-size win).$BASE_IMAGE-ARG workers left onconda run(not in the production manifest).tifffile/girder_client).🤖 Generated with Claude Code
https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM