Fix Dockerfile CMD for provider=runpod (endless re-provisioning loop)#14
Fix Dockerfile CMD for provider=runpod (endless re-provisioning loop)#14nataliasocaity wants to merge 9 commits into
Conversation
* fix: runpod router uses latest job result * dev: improved scan and build * fix: scan flow was detecting ml when there was not
The rendered Dockerfile used `CMD uvicorn service:app` for every provider,
but APIPod(provider="runpod") returns a SocaityRunpodRouter that is not an
ASGI application. uvicorn imports it, the container looks healthy from the
outside, then every request 500s with:
TypeError: 'SocaityRunpodRouter' object is not callable
In RunPod's deploy flow this manifests as endless re-provisioning: the
orchestrator sees the failing requests, tears the container down, retries.
Switch the CMD to `python <service>.py --rp_api_host 0.0.0.0 --rp_api_port 8000`
when provider=runpod, so app.start() hands off to runpod's worker harness
(verified locally: `Starting Serverless Worker | Version 1.9.0`).
uvicorn behavior is preserved for all other providers.
|
This works and is correct.. please checkout face2face how it was done there and if it was done there in a generalized way. Apipod built should create the docker file correctly though based on -- settings. Needs a check |
|
Thanks for the pointers, I went through the three things before touching anything else. Wanted to check with you that I'm reading this right. 1. face2face I cloned it and looked at 2. "apipod build should create the Dockerfile correctly based on the settings" I think I see what you mean. I set up a small test project and ran
I think the reason is in config_data["orchestrator"] = args.orchestrator
config_data["compute"] = args.compute
config_data["provider"] = args.providerSince I also checked the tests folder — there are tests for 3. Issues from yesterday I didn't find new issues in GitHub for APIPod / qwen-models / face2face. I assumed you meant the frictions list I've been keeping in What I think I should do next (want to confirm before pushing anything) If I'm reading you right, this PR needs two more bits:
Is that what you had in mind, or am I going further than you wanted for this PR? |
w4hns1nn
left a comment
There was a problem hiding this comment.
In general seems reasonable, but without testing or IDE I can't tell if that code actually makes sense
| "found_config": False | ||
| "title": "apipod-service", | ||
| "found_config": False, | ||
| "orchestrator": "local", |
There was a problem hiding this comment.
Why was the python bookworm image been introduced like a default?
There was a problem hiding this comment.
Good catch. Those defaults were already in DeploymentConfig and in the scanner's .get(default) calls, so I was duplicating. Dropped them from the detector dict and added a comment so we don't add them back later.
There was a problem hiding this comment.
What is a starter file and readme?
There was a problem hiding this comment.
Fair, the name was vague. Renamed to _write_deploy_dir_helpers and added a docstring explaining the two files it drops (README from the starter template and .dockerignore) and that user-created versions are preserved.
…lag is omitted
A plain `apipod --build` was overwriting the values from apipod.json with
argparse defaults ("localhost"/"dedicated"/"local"), so a service configured
with provider: runpod was rendering uvicorn as CMD in the Dockerfile. The
CMD conditional in docker_template.j2 already routed correctly when the
provider arrived as "runpod", but it never did because the CLI replaced it
before the template was rendered.
Fix: drop the argparse defaults to None and only merge into config_data when
the user actually passed the flag. run_start keeps the historical defaults
("local"/"dedicated"/"localhost") as a fallback so `apipod --start` without
flags still behaves the same.
Tests cover render_dockerfile against both provider values plus the minimal
profile, where the CMD is hardcoded but the ENV must still propagate.
|
Hey, I looked at it again and the confusion comes from the table; between what you will experience on local development vs what will happen on apipod deploy. If you are locally debugging your code and no setting is set, the service will start as fastapi service as it should, and you will test your service by navigating to the /docs route on ýour server. If you want to experience how the service would look like deployed on socaity you will set the --localhost flag to emulate the service with the queue. We need to find a clean way to seperate "build/deploy" in the dockerfile and local "testing" for developery to avoid this confusion. @nataliasocaity @K4rlosReyes Then update the PR and I will have another look at it. |
| "python:3.11-slim", | ||
| "python:3.10-slim", | ||
| "nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04", # Standard CUDA Runtime | ||
| "ghcr.io/astral-sh/uv:python3.12-bookworm-slim", |
There was a problem hiding this comment.
This change remains mysterical its probably an error
There was a problem hiding this comment.
Removed it from DEFAULT_IMAGES and from the preferred list in _load_images. Heads up: profile.py:173 still returns the astral image for PROFILE_SERVERLESS_MINIMAL because docker_template_minimal.j2 uses uv. If you want to drop that path entirely, let me know and I'll open a separate PR for it.
| content = json.load(handle) | ||
| if isinstance(content, dict): | ||
| keys = content.keys() | ||
| model_keys = { |
There was a problem hiding this comment.
Some comments like before help to explain whats going on
There was a problem hiding this comment.
Added a docstring to _is_model_json walking through the three-layer check (filename whitelist, blocklist, content sniff) and an inline comment on the HF model_keys set.
| if profile == PROFILE_SERVERLESS_MINIMAL: | ||
| version = str(config.get("python_version") or "3.12") | ||
| for img in self.images: | ||
| if f"python{version}" in img and "astral-sh/uv" in img: |
There was a problem hiding this comment.
Removed in the same commit as the DEFAULT_IMAGES change. See the note in the line 22 thread about the minimal-profile path that still depends on uv.
| print(f"Dockerfile created at {dockerfile_path}") | ||
| return dockerfile_path | ||
|
|
||
| def write_project_dockerignore(self) -> Path: |
There was a problem hiding this comment.
Good addon!
Note, that when we add the standardized method to load models we should include the model in the docker container for faster boot times.
There was a problem hiding this comment.
Thanks. Added a TODO next to COPY . . in docker_template.j2 so we don't lose track of baking model files into the image once the standardized model-loading hook lands.
There was a problem hiding this comment.
This was the bug you experienced,
please see the global comment I made about this; and workout a clean strategy for doing it in code and describing it not confusing to the user...
There was a problem hiding this comment.
Shortened the rationale (the 5-line block was overkill), kept the one-liner about runpod returning a non-ASGI router. The clean strategy for the bigger build/deploy/dev separation will go in the separate PR I outlined in the top-level comment.
|
Hey Matthias, my proposal is to split what the CLI does today into three clean cases instead of mixing them:
The CLI override fix from this PR stays as is (flags don't overwrite the json unless passed). On the 7 review comments you left: I'll address those here in this PR so it merges clean. The bigger architectural change ( One thing I'm not sure about: is "ask the first time, then remember" the right UX for the build target, or do you want a hard default of socaity-runpod with no question? @K4rlosReyes how do you see this? Anything I'm missing from the runtime side? |
- entrypoint.py: drop duplicated orchestrator/compute/provider defaults from the detector result. The DeploymentConfig dataclass and the scanner .get(default) calls already own those, so the detector only carries fields it actually populates from user code. - scanner.py: rename _write_starter_files -> _write_deploy_dir_helpers, add docstring explaining the two files it drops (README.md from the starter template, and .dockerignore) and that user-created versions are preserved. - framework.py: add docstring + inline note to _is_model_json so the three-layer heuristic (filename whitelist, blocklist, content sniff) is readable without tracing the function. - docker_factory.py: remove the astral-sh/uv base image from DEFAULT_IMAGES and from the _load_images preferred list, and drop the astral-favoured branch in recommend_image's minimal-profile lookup. The minimal Jinja template still depends on uv so recommend_base_image in profile.py keeps returning the astral image for that profile only. - docker_template.j2: shorten the CMD comment block (the 5-line runpod rationale was verbose), and add a TODO near COPY . . for baking model files into the image once APIPod ships a standardized model-loading hook (per Matthias's note on factory:147). Tests: test_render_dockerfile.py + test_deploy_profile.py, 8/8 passing.
Summary
The Dockerfile generated by
apipod --buildusedCMD ["uvicorn", "service:app", ...]for every provider. But when
APIPod(provider="runpod", ...)is used (the standardRunPod serverless config), the factory returns a
SocaityRunpodRouterthat is notASGI-callable. uvicorn imports it, the container starts cleanly, then every request
500s with:
In RunPod this manifests as endless re-provisioning — the orchestrator sees the
failing health checks, tears the container down, retries forever. This affects every
serverless RunPod service built with
apipod --build, not just Qwen.Fix
The service's own
main()already does the right thing: it callsapp.start(port, host),which for
provider=runpodhands off to runpod's serverless worker harness. So theCMD just needs to launch the script directly when
provider=runpod:This matches the pattern already used in
docker_template_minimal.j2. uvicorn ispreserved for all other providers.
Changes
apipod/deploy/docker_template.j2— CMD is now conditional onproviderapipod/deploy/docker_factory.py— passesentrypoint_scriptto the template contextVerified locally
Rebuilt the qwen-models container with the patched Dockerfile:
Before (uvicorn):
After (python script):
The worker boots into the correct runpod harness; in production it picks up jobs from
the orchestrator instead of looking for a local
test_input.json.Test plan
apipod --buildon a service withprovider=runpodand confirm the CMD ispython-basedapipod --buildon a service withprovider=localhost/scalewayand confirm CMD is still uvicorn