Reduce worker startup latency: fast env-activation entrypoint, lazy ML imports, slimmer bases by arjunrajlab · Pull Request #148 · arjunrajlaboratory/ImageAnalysisProject

arjunrajlab · 2026-06-27T20:10:32Z

Why

NimbusImage runs each worker as a fresh docker run --rm container per job (images are already local — registry pull is not the cost). Two avoidable costs were paid on every invocation:

conda run -n worker … in the ENTRYPOINT spawns a second Python process (the conda CLI) just to activate the env and exec the target — ~0.8–1.0 s of pure overhead per job (~70% of a light worker's startup).
Heavy ML imports (torch/TF/cellpose/sam2/…) ran at module load, so even the lightweight interface request (the interactive, user-facing path) paid multi-second imports it never used.

Full audit, measurements, and methodology: todo/worker-startup-latency.md (TODO-002).

What changed

Fast entrypoint. New workers/base_docker_images/run_worker.sh activates the conda env in-process (PATH + activate.d hooks, preserving GDAL_DATA/PROJ_DATA/GDAL_DRIVER_PATH) and execs the env python — no second interpreter. Auto-detects miniforge (amd64) / miniconda (arm64).

Baked into worker-base + image-processing-base as the ENTRYPOINT; the 19 shared-base workers inherit it. worker_client added to worker-base for parity.
Rolled out to the 28 remaining production Dockerfiles (13 GPU workers + deconwolf + ai_analysis + blob_random_forest_classifier, incl. _M1).

Lazy imports. Heavy ML libs deferred into compute()/helpers across all GPU workers (+ sam_automatic_mask_generator); matplotlib deferred in the shared annotation_tools.py.

Refactors / cleanup. laplacian_of_gaussian, line_scan_worker, blob_colony_two_color_intensity_worker now build FROM worker-base (dropping inline base duplication, a stale Kitware/UPennContrast clone, and hardcoded x86_64 miniforge); obsolete _M1 files deleted. line_scan __main__ aligned to the canonical dispatch. Dropped unused r-base + added conda clean to both bases.

Results (arm64, full startup via `--help`, median of 5)

Worker	Before	After
blob_intensity	1.21 s	0.41 s
crop	1.22 s	0.44 s
blob_metrics	~1.2 s	0.40 s
registration	~1.2 s	0.51 s

Base images: worker-base 5.18 → 4.74 GB, image-processing-base 9.65 → 9.09 GB (disk only; does not affect startup).

Validation

CPU workers built + tested: blob_metrics 9, registration 20, crop 14, blob_intensity 9, line_scan 7, blob_random_forest 4 — all pass. GDAL/PROJ env vars verified identical to the conda run reference.
⚠️ GPU workers are static-validated only (py_compile + import-usage analysis) — no local CUDA/torch. They must be built on the amd64 build host and smoke-tested (imports resolve at runtime, a compute run works) before deploy.

Deferred (not in this PR)

Multi-stage prune of build-essential/git/dev headers (riskier image-size win).
Deprecated / $BASE_IMAGE-ARG workers left on conda run (not in the production manifest).
Trimming other always-on imports (tifffile/girder_client).

🤖 Generated with Claude Code

https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM

…lim bases NimbusImage runs a fresh container per job. `conda run -n worker` in the ENTRYPOINT spawned a second Python process, adding ~0.8-1.0s to every invocation, and heavy ML imports were paid even by the lightweight `interface` request. This cuts both. Startup mechanism: - Add workers/base_docker_images/run_worker.sh: activates the worker conda env in-process (PATH + activate.d hooks, preserving GDAL_DATA/PROJ_DATA/ GDAL_DRIVER_PATH) and execs the env python, replacing `conda run -n worker`. Auto-detects miniforge (amd64) / miniconda (arm64). - Bake it into worker-base and image-processing-base as the ENTRYPOINT; the 19 shared-base workers inherit it (per-worker `conda run` removed). Add worker_client to worker-base for parity. - Convert the remaining 28 production Dockerfiles (13 GPU workers + deconwolf + ai_analysis + blob_random_forest_classifier, incl. _M1 variants) to run_worker.sh. Measured (arm64, full startup via --help, median of 5): blob_intensity 1.21->0.41s, crop 1.22->0.44s, blob_metrics ~1.2->0.40s, registration ->0.51s. CPU worker tests pass (blob_metrics 9, registration 20, crop 14, blob_intensity 9, line_scan 7, blob_random_forest 4). GDAL/PROJ env vars verified identical to conda-run. Import deferral: - Move heavy ML imports (torch/tensorflow/cellpose/deeptile/stardist/sam2/ segment_anything/deepcell/piscis) from module load into compute()/helpers across all GPU workers, so the interface request and startup no longer pay them. Also defer condensatenet's model package and sam_fewshot's torchvision NMS patch. - annotation_utilities/annotation_tools.py: import matplotlib lazily (~50ms off nearly every worker). Refactors / cleanup: - laplacian_of_gaussian, line_scan_worker, blob_colony_two_color_intensity_worker build FROM worker-base instead of replicating the base inline (drops stale Kitware/UPennContrast clone + hardcoded x86_64 miniforge); deleted their obsolete Dockerfile_M1, pinned compose to literal Dockerfile, fixed legacy build script. - line_scan_worker __main__ aligned to the canonical match-request dispatch. - Drop unused r-base from both base images; add `conda clean` (worker-base 5.18->4.74GB, image-processing-base 9.65->9.09GB). - Normalize piscis predict/train entrypoints from CRLF to LF. GPU worker changes are static-validated (py_compile + usage analysis) only; they require amd64 build-host validation before deploy (no local CUDA/torch). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM

Extend TODO-002 with the 2026-06-27 continued log: task (a) rollout of run_worker.sh to all 28 remaining production GPU/self-contained Dockerfiles (+ sam_automatic_mask_generator lazy imports), and task (b) image-size wins (#5 r-base removal, #7 conda clean; bases 5.18->4.74GB / 9.65->9.09GB). Updates the status section and remaining-work list (#6 multi-stage prune and GPU build-host validation still open). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5cd5e96cd1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-27T20:17:30Z

+      defaultToolName="Spots"
+
+# Fast env activation in place of `conda run` (see todo/worker-startup-latency.md)
+COPY ./workers/base_docker_images/run_worker.sh /usr/local/bin/run_worker.sh


Keep run_worker inside the Piscis predict context

The Piscis compose file builds this Dockerfile with context: . from workers/annotations/piscis and dockerfile: ./predict/Dockerfile, while Docker COPY sources are resolved from the build context (Dockerfile docs; Compose build docs). Under that context ./workers/base_docker_images/run_worker.sh is absent/outside the build context, so docker compose build predict fails before the predict image is built; the rest of this Dockerfile still uses Piscis-local paths like ./environment.yml, confirming it is meant for the nested context.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-27T20:17:30Z

+      defaultToolName="Piscis training"
+
+# Fast env activation in place of `conda run` (see todo/worker-startup-latency.md)
+COPY ./workers/base_docker_images/run_worker.sh /usr/local/bin/run_worker.sh


Keep run_worker inside the Piscis train context

The Piscis compose file builds this Dockerfile with context: . from workers/annotations/piscis and dockerfile: ./train/Dockerfile, while Docker COPY sources are resolved from the build context (Dockerfile docs; Compose build docs). Under that context ./workers/base_docker_images/run_worker.sh is absent/outside the build context, so docker compose build train fails before the training image is built; the rest of this Dockerfile still uses Piscis-local paths like ./environment.yml, confirming it is meant for the nested context.

Useful? React with 👍 / 👎.

PR #148's run_worker.sh rollout added a repo-root-relative `COPY ./workers/base_docker_images/run_worker.sh` to the piscis predict/train Dockerfiles, but piscis is the only GPU worker built via compose with `context: .` (the piscis subdir), so that path did not resolve and both piscis images failed to build. The other 12 GPU workers build from repo-root context, where the identical line works. - docker-compose.yaml: build both services from repo-root context (context: ../../.. + dockerfile: ./workers/annotations/piscis/...). - predict/train Dockerfiles: re-root the piscis-local COPYs to ./workers/annotations/piscis/... so they resolve under the new context; the run_worker.sh COPY is correct as-is. - Install a CUDA-enabled jax (jax[cuda12]) so piscis runs on the GPU instead of the CPU jaxlib that `pip install flax` pulls by default. Build-context fix verified on a g4dn GPU host (driver 580): all 13 ML/GPU workers build + smoke-test clean. CUDA-jax change verified on a fresh GPU build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017ovGUQDgkmBmKFYaJ3FbUW

Follow-up to 8748cac. The current piscis is torch-based: the worker entrypoints import only torch, and neither piscis 1.0.0 (PyPI, predict) nor the zjniu/Piscis source (train) declares jax or flax. jax entered the image solely via a leftover `pip install flax` from the old jax-based Piscis. So the jax[cuda12] GPU install added in 8748cac was making an unused library GPU-capable (~14 GB of CUDA wheels per image for nothing). Remove both it and the `pip install flax` holdover; piscis's GPU work goes through torch (cuda.is_available()=True). Keeps the build-context fix from 8748cac. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017ovGUQDgkmBmKFYaJ3FbUW

…h; sweep unused imports Continuation of the worker-startup-latency work (TODO-002). Deferral (17 production entrypoints): heavy libs used only in compute/helpers (pandas, geopandas, scipy, sklearn, mahotas, large_image, rasterio) moved into those functions, each with a comment, so the lightweight `interface` request no longer loads them. Measured --help proxy: blob_random_forest 0.82->0.32s (-0.50s), crop 0.43->0.34s, connect_to_nearest 0.33s, registration 0.39s. - registration: only large_image deferred; StackReg kept at module top (its tests patch entrypoint.StackReg and reference StackReg.* class constants). - stardist/piscis_train (rasterio) + GPU workers: static-validated only; need amd64 build-host runtime validation. Unused-import sweep: removed 143 unused module-level imports across 41 entrypoints (conservative AST: zero name-references, single-line top-level only, kept used names in multi-name lines, skipped conditional imports). Mostly ~0 startup (stdlib/numpy-shadowed/lazy-skimage); real wins were dead imageio (crop/histogram_matching/registration) and scipy.spatial.distance (point_to_nearest_blob_distance). ai_analysis left untouched (deprecation pending). All buildable worker-profile test suites pass (incl. deconwolf 38, registration 20, h_and_e 9 with dead hed2rgb removed). Tracking doc updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM

…porting piscis) The piscis `interface` request only builds a UI dict + lists local model filenames, but it was paying the full ~4s torch import. Both entrypoints AND utils.py had a module-level `from piscis.paths import MODELS_DIR`, and `from piscis.paths import X` runs piscis/__init__.py -> `from piscis.core import Piscis` -> `import torch`. Because both entrypoints `import utils` at module level (and interface() calls utils.list_girder_models), the torch import was unavoidable on the interface path even though torch is deferred in compute(). Fix: define MODELS_DIR as a plain pathlib path (= ~/.piscis/models, mirroring piscis.paths.MODELS_DIR) in utils.py WITHOUT importing piscis, and have both entrypoints use utils.MODELS_DIR. compute()'s deferred piscis/torch imports are unchanged. Verified on a g4dn GPU build (driver 580): - loading entrypoint.py: torch NOT in sys.modules (was loaded before); importtime shows no torch/jax. - interface-path import 0.35-0.41s vs ~4.2s for the old piscis.paths import. - model list unchanged: 20230616/20230709/20230905/20251212/ps_20240419_112256. - no regression: piscis_predict/train build (built=2 failed=0); compute path still imports torch+piscis and sees the GPU. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017ovGUQDgkmBmKFYaJ3FbUW

…GPU validation All 13 GPU workers were build-host validated on 2026-06-28 (g4dn/driver-580): builds + smoke-clean, lazy-import refactor runtime-clean. Records the three piscis fixes found there — run_worker.sh build-context (8748cac), vestigial jax/flax removal (60afa80), and the interface ~4s torch import (bc70262) — and flips the item-#2 "static-validated only" caveat + the piscis MODELS_DIR/__init__ question to RESOLVED. Remaining: a real end-to-end compute run vs live data. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_017ovGUQDgkmBmKFYaJ3FbUW

…n summary The parenthetical and the detailed 2026-06-28 log both list three fixes (run_worker.sh build-context error, vestigial jax/flax dependency, torch import on the interface path); the summary line said "Two". Doc-only consistency fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM

arjunrajlab and others added 3 commits June 27, 2026 15:47

docs(todo): link TODO-002 to PR #148

a5733bd

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM

chatgpt-codex-connector Bot reviewed Jun 27, 2026

View reviewed changes

arjunrajlab and others added 6 commits June 28, 2026 00:54

arjunrajlab merged commit 4d5c42e into master Jun 29, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce worker startup latency: fast env-activation entrypoint, lazy ML imports, slimmer bases#148

Reduce worker startup latency: fast env-activation entrypoint, lazy ML imports, slimmer bases#148
arjunrajlab merged 9 commits into
masterfrom
claude/worker-startup-latency

arjunrajlab commented Jun 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 27, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

arjunrajlab commented Jun 27, 2026

Why

What changed

Results (arm64, full startup via --help, median of 5)

Validation

Deferred (not in this PR)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Results (arm64, full startup via `--help`, median of 5)