Skip to content

Reduce worker startup latency: fast env-activation entrypoint, lazy ML imports, slimmer bases#148

Merged
arjunrajlab merged 9 commits into
masterfrom
claude/worker-startup-latency
Jun 29, 2026
Merged

Reduce worker startup latency: fast env-activation entrypoint, lazy ML imports, slimmer bases#148
arjunrajlab merged 9 commits into
masterfrom
claude/worker-startup-latency

Conversation

@arjunrajlab

Copy link
Copy Markdown
Collaborator

Why

NimbusImage runs each worker as a fresh docker run --rm container per job (images are already local — registry pull is not the cost). Two avoidable costs were paid on every invocation:

  1. conda run -n worker … in the ENTRYPOINT spawns a second Python process (the conda CLI) just to activate the env and exec the target — ~0.8–1.0 s of pure overhead per job (~70% of a light worker's startup).
  2. Heavy ML imports (torch/TF/cellpose/sam2/…) ran at module load, so even the lightweight interface request (the interactive, user-facing path) paid multi-second imports it never used.

Full audit, measurements, and methodology: todo/worker-startup-latency.md (TODO-002).

What changed

Fast entrypoint. New workers/base_docker_images/run_worker.sh activates the conda env in-process (PATH + activate.d hooks, preserving GDAL_DATA/PROJ_DATA/GDAL_DRIVER_PATH) and execs the env python — no second interpreter. Auto-detects miniforge (amd64) / miniconda (arm64).

  • Baked into worker-base + image-processing-base as the ENTRYPOINT; the 19 shared-base workers inherit it. worker_client added to worker-base for parity.
  • Rolled out to the 28 remaining production Dockerfiles (13 GPU workers + deconwolf + ai_analysis + blob_random_forest_classifier, incl. _M1).

Lazy imports. Heavy ML libs deferred into compute()/helpers across all GPU workers (+ sam_automatic_mask_generator); matplotlib deferred in the shared annotation_tools.py.

Refactors / cleanup. laplacian_of_gaussian, line_scan_worker, blob_colony_two_color_intensity_worker now build FROM worker-base (dropping inline base duplication, a stale Kitware/UPennContrast clone, and hardcoded x86_64 miniforge); obsolete _M1 files deleted. line_scan __main__ aligned to the canonical dispatch. Dropped unused r-base + added conda clean to both bases.

Results (arm64, full startup via --help, median of 5)

Worker Before After
blob_intensity 1.21 s 0.41 s
crop 1.22 s 0.44 s
blob_metrics ~1.2 s 0.40 s
registration ~1.2 s 0.51 s

Base images: worker-base 5.18 → 4.74 GB, image-processing-base 9.65 → 9.09 GB (disk only; does not affect startup).

Validation

  • CPU workers built + tested: blob_metrics 9, registration 20, crop 14, blob_intensity 9, line_scan 7, blob_random_forest 4 — all pass. GDAL/PROJ env vars verified identical to the conda run reference.
  • ⚠️ GPU workers are static-validated only (py_compile + import-usage analysis) — no local CUDA/torch. They must be built on the amd64 build host and smoke-tested (imports resolve at runtime, a compute run works) before deploy.

Deferred (not in this PR)

  • Multi-stage prune of build-essential/git/dev headers (riskier image-size win).
  • Deprecated / $BASE_IMAGE-ARG workers left on conda run (not in the production manifest).
  • Trimming other always-on imports (tifffile/girder_client).

🤖 Generated with Claude Code

https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM

arjunrajlab and others added 3 commits June 27, 2026 15:47
…lim bases

NimbusImage runs a fresh container per job. `conda run -n worker` in the
ENTRYPOINT spawned a second Python process, adding ~0.8-1.0s to every
invocation, and heavy ML imports were paid even by the lightweight `interface`
request. This cuts both.

Startup mechanism:
- Add workers/base_docker_images/run_worker.sh: activates the worker conda env
  in-process (PATH + activate.d hooks, preserving GDAL_DATA/PROJ_DATA/
  GDAL_DRIVER_PATH) and execs the env python, replacing `conda run -n worker`.
  Auto-detects miniforge (amd64) / miniconda (arm64).
- Bake it into worker-base and image-processing-base as the ENTRYPOINT; the 19
  shared-base workers inherit it (per-worker `conda run` removed). Add
  worker_client to worker-base for parity.
- Convert the remaining 28 production Dockerfiles (13 GPU workers + deconwolf +
  ai_analysis + blob_random_forest_classifier, incl. _M1 variants) to run_worker.sh.

Measured (arm64, full startup via --help, median of 5): blob_intensity 1.21->0.41s,
crop 1.22->0.44s, blob_metrics ~1.2->0.40s, registration ->0.51s. CPU worker tests
pass (blob_metrics 9, registration 20, crop 14, blob_intensity 9, line_scan 7,
blob_random_forest 4). GDAL/PROJ env vars verified identical to conda-run.

Import deferral:
- Move heavy ML imports (torch/tensorflow/cellpose/deeptile/stardist/sam2/
  segment_anything/deepcell/piscis) from module load into compute()/helpers across
  all GPU workers, so the interface request and startup no longer pay them. Also
  defer condensatenet's model package and sam_fewshot's torchvision NMS patch.
- annotation_utilities/annotation_tools.py: import matplotlib lazily (~50ms off
  nearly every worker).

Refactors / cleanup:
- laplacian_of_gaussian, line_scan_worker, blob_colony_two_color_intensity_worker
  build FROM worker-base instead of replicating the base inline (drops stale
  Kitware/UPennContrast clone + hardcoded x86_64 miniforge); deleted their obsolete
  Dockerfile_M1, pinned compose to literal Dockerfile, fixed legacy build script.
- line_scan_worker __main__ aligned to the canonical match-request dispatch.
- Drop unused r-base from both base images; add `conda clean` (worker-base
  5.18->4.74GB, image-processing-base 9.65->9.09GB).
- Normalize piscis predict/train entrypoints from CRLF to LF.

GPU worker changes are static-validated (py_compile + usage analysis) only; they
require amd64 build-host validation before deploy (no local CUDA/torch).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM
Extend TODO-002 with the 2026-06-27 continued log: task (a) rollout of
run_worker.sh to all 28 remaining production GPU/self-contained Dockerfiles
(+ sam_automatic_mask_generator lazy imports), and task (b) image-size wins
(#5 r-base removal, #7 conda clean; bases 5.18->4.74GB / 9.65->9.09GB).
Updates the status section and remaining-work list (#6 multi-stage prune and
GPU build-host validation still open).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5cd5e96cd1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

defaultToolName="Spots"

# Fast env activation in place of `conda run` (see todo/worker-startup-latency.md)
COPY ./workers/base_docker_images/run_worker.sh /usr/local/bin/run_worker.sh

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep run_worker inside the Piscis predict context

The Piscis compose file builds this Dockerfile with context: . from workers/annotations/piscis and dockerfile: ./predict/Dockerfile, while Docker COPY sources are resolved from the build context (Dockerfile docs; Compose build docs). Under that context ./workers/base_docker_images/run_worker.sh is absent/outside the build context, so docker compose build predict fails before the predict image is built; the rest of this Dockerfile still uses Piscis-local paths like ./environment.yml, confirming it is meant for the nested context.

Useful? React with 👍 / 👎.

defaultToolName="Piscis training"

# Fast env activation in place of `conda run` (see todo/worker-startup-latency.md)
COPY ./workers/base_docker_images/run_worker.sh /usr/local/bin/run_worker.sh

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep run_worker inside the Piscis train context

The Piscis compose file builds this Dockerfile with context: . from workers/annotations/piscis and dockerfile: ./train/Dockerfile, while Docker COPY sources are resolved from the build context (Dockerfile docs; Compose build docs). Under that context ./workers/base_docker_images/run_worker.sh is absent/outside the build context, so docker compose build train fails before the training image is built; the rest of this Dockerfile still uses Piscis-local paths like ./environment.yml, confirming it is meant for the nested context.

Useful? React with 👍 / 👎.

arjunrajlab and others added 6 commits June 28, 2026 00:54
PR #148's run_worker.sh rollout added a repo-root-relative
`COPY ./workers/base_docker_images/run_worker.sh` to the piscis
predict/train Dockerfiles, but piscis is the only GPU worker built via
compose with `context: .` (the piscis subdir), so that path did not
resolve and both piscis images failed to build. The other 12 GPU
workers build from repo-root context, where the identical line works.

- docker-compose.yaml: build both services from repo-root context
  (context: ../../.. + dockerfile: ./workers/annotations/piscis/...).
- predict/train Dockerfiles: re-root the piscis-local COPYs to
  ./workers/annotations/piscis/... so they resolve under the new
  context; the run_worker.sh COPY is correct as-is.
- Install a CUDA-enabled jax (jax[cuda12]) so piscis runs on the GPU
  instead of the CPU jaxlib that `pip install flax` pulls by default.

Build-context fix verified on a g4dn GPU host (driver 580): all 13
ML/GPU workers build + smoke-test clean. CUDA-jax change verified on a
fresh GPU build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017ovGUQDgkmBmKFYaJ3FbUW
Follow-up to 8748cac. The current piscis is torch-based: the worker
entrypoints import only torch, and neither piscis 1.0.0 (PyPI, predict)
nor the zjniu/Piscis source (train) declares jax or flax. jax entered
the image solely via a leftover `pip install flax` from the old
jax-based Piscis.

So the jax[cuda12] GPU install added in 8748cac was making an unused
library GPU-capable (~14 GB of CUDA wheels per image for nothing).
Remove both it and the `pip install flax` holdover; piscis's GPU work
goes through torch (cuda.is_available()=True). Keeps the build-context
fix from 8748cac.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017ovGUQDgkmBmKFYaJ3FbUW
…h; sweep unused imports

Continuation of the worker-startup-latency work (TODO-002).

Deferral (17 production entrypoints): heavy libs used only in compute/helpers
(pandas, geopandas, scipy, sklearn, mahotas, large_image, rasterio) moved into those
functions, each with a comment, so the lightweight `interface` request no longer
loads them. Measured --help proxy: blob_random_forest 0.82->0.32s (-0.50s),
crop 0.43->0.34s, connect_to_nearest 0.33s, registration 0.39s.
- registration: only large_image deferred; StackReg kept at module top (its tests
  patch entrypoint.StackReg and reference StackReg.* class constants).
- stardist/piscis_train (rasterio) + GPU workers: static-validated only; need amd64
  build-host runtime validation.

Unused-import sweep: removed 143 unused module-level imports across 41 entrypoints
(conservative AST: zero name-references, single-line top-level only, kept used names
in multi-name lines, skipped conditional imports). Mostly ~0 startup
(stdlib/numpy-shadowed/lazy-skimage); real wins were dead imageio
(crop/histogram_matching/registration) and scipy.spatial.distance
(point_to_nearest_blob_distance). ai_analysis left untouched (deprecation pending).

All buildable worker-profile test suites pass (incl. deconwolf 38, registration 20,
h_and_e 9 with dead hed2rgb removed). Tracking doc updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM
…porting piscis)

The piscis `interface` request only builds a UI dict + lists local model
filenames, but it was paying the full ~4s torch import. Both entrypoints AND
utils.py had a module-level `from piscis.paths import MODELS_DIR`, and
`from piscis.paths import X` runs piscis/__init__.py -> `from piscis.core import
Piscis` -> `import torch`. Because both entrypoints `import utils` at module
level (and interface() calls utils.list_girder_models), the torch import was
unavoidable on the interface path even though torch is deferred in compute().

Fix: define MODELS_DIR as a plain pathlib path (= ~/.piscis/models, mirroring
piscis.paths.MODELS_DIR) in utils.py WITHOUT importing piscis, and have both
entrypoints use utils.MODELS_DIR. compute()'s deferred piscis/torch imports
are unchanged.

Verified on a g4dn GPU build (driver 580):
- loading entrypoint.py: torch NOT in sys.modules (was loaded before); importtime
  shows no torch/jax.
- interface-path import 0.35-0.41s vs ~4.2s for the old piscis.paths import.
- model list unchanged: 20230616/20230709/20230905/20251212/ps_20240419_112256.
- no regression: piscis_predict/train build (built=2 failed=0); compute path still
  imports torch+piscis and sees the GPU.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017ovGUQDgkmBmKFYaJ3FbUW
…GPU validation

All 13 GPU workers were build-host validated on 2026-06-28 (g4dn/driver-580):
builds + smoke-clean, lazy-import refactor runtime-clean. Records the three
piscis fixes found there — run_worker.sh build-context (8748cac), vestigial
jax/flax removal (60afa80), and the interface ~4s torch import (bc70262) — and
flips the item-#2 "static-validated only" caveat + the piscis MODELS_DIR/__init__
question to RESOLVED. Remaining: a real end-to-end compute run vs live data.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_017ovGUQDgkmBmKFYaJ3FbUW
…n summary

The parenthetical and the detailed 2026-06-28 log both list three fixes
(run_worker.sh build-context error, vestigial jax/flax dependency, torch import
on the interface path); the summary line said "Two". Doc-only consistency fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015xrXEVpvb4c4ScEVjN1VsM
@arjunrajlab arjunrajlab merged commit 4d5c42e into master Jun 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant