Fix Dockerfile CMD for provider=runpod (endless re-provisioning loop) by nataliasocaity · Pull Request #14 · SocAIty/APIPod

nataliasocaity · 2026-06-01T10:18:25Z

Summary

The Dockerfile generated by apipod --build used CMD ["uvicorn", "service:app", ...]
for every provider. But when APIPod(provider="runpod", ...) is used (the standard
RunPod serverless config), the factory returns a SocaityRunpodRouter that is not
ASGI-callable. uvicorn imports it, the container starts cleanly, then every request
500s with:

TypeError: 'SocaityRunpodRouter' object is not callable

In RunPod this manifests as endless re-provisioning — the orchestrator sees the
failing health checks, tears the container down, retries forever. This affects every
serverless RunPod service built with apipod --build, not just Qwen.

Fix

The service's own main() already does the right thing: it calls app.start(port, host),
which for provider=runpod hands off to runpod's serverless worker harness. So the
CMD just needs to launch the script directly when provider=runpod:

CMD ["python", "service.py", "--rp_api_host", "0.0.0.0", "--rp_api_port", "8000"]

This matches the pattern already used in docker_template_minimal.j2. uvicorn is
preserved for all other providers.

Changes

apipod/deploy/docker_template.j2 — CMD is now conditional on provider
apipod/deploy/docker_factory.py — passes entrypoint_script to the template context

Verified locally

Rebuilt the qwen-models container with the patched Dockerfile:

Before (uvicorn):

INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000
ERROR:    Exception in ASGI application
TypeError: 'SocaityRunpodRouter' object is not callable

After (python script):

--- Starting Serverless Worker |  Version 1.9.0 ---
WARN   | test_input.json not found, exiting.

The worker boots into the correct runpod harness; in production it picks up jobs from
the orchestrator instead of looking for a local test_input.json.

Test plan

Run apipod --build on a service with provider=runpod and confirm the CMD is python-based
Run apipod --build on a service with provider=localhost/scaleway and confirm CMD is still uvicorn
Deploy on RunPod and confirm no more re-provisioning loop

Dev

* fix: runpod router uses latest job result * dev: improved scan and build * fix: scan flow was detecting ml when there was not

The rendered Dockerfile used `CMD uvicorn service:app` for every provider, but APIPod(provider="runpod") returns a SocaityRunpodRouter that is not an ASGI application. uvicorn imports it, the container looks healthy from the outside, then every request 500s with: TypeError: 'SocaityRunpodRouter' object is not callable In RunPod's deploy flow this manifests as endless re-provisioning: the orchestrator sees the failing requests, tears the container down, retries. Switch the CMD to `python <service>.py --rp_api_host 0.0.0.0 --rp_api_port 8000` when provider=runpod, so app.start() hands off to runpod's worker harness (verified locally: `Starting Serverless Worker | Version 1.9.0`). uvicorn behavior is preserved for all other providers.

w4hns1nn · 2026-06-01T13:34:43Z

This works and is correct.. please checkout face2face how it was done there and if it was done there in a generalized way.

Apipod built should create the docker file correctly though based on -- settings. Needs a check

nataliasocaity · 2026-06-02T08:39:52Z

Thanks for the pointers, I went through the three things before touching anything else. Wanted to check with you that I'm reading this right.

1. face2face

I cloned it and looked at apipod-deploy/Dockerfile and apipod.json. It doesn't look generalized to me — the Dockerfile is hand-written, the paths are Windows-style (face2face\server.py), and it ends with CMD ["uvicorn", "face2face\server:app", ...], which would hit the same issue Qwen had if it got rebuilt today. So I don't think there's anything to copy from there. Let me know if I'm missing something.

2. "apipod build should create the Dockerfile correctly based on the settings"

I think I see what you mean. I set up a small test project and ran apipod --build three times to compare:

`apipod.json` says	CLI flag I passed	What ended up in the Dockerfile
`provider: runpod`	(nothing)	`ENV APIPOD_PROVIDER="localhost"` + `CMD uvicorn`
`provider: runpod` (minimal profile)	(nothing)	`ENV APIPOD_PROVIDER="localhost"` + `CMD python ...` (CMD ended up right by accident, ENVs still wrong)
`provider: runpod`	`--provider runpod`	`ENV APIPOD_PROVIDER="runpod"` + `CMD python ...` (this is what we want)

I think the reason is in apipod/cli.py:133-135:

config_data["orchestrator"] = args.orchestrator
config_data["compute"]      = args.compute
config_data["provider"]     = args.provider

Since --provider, --compute and --orchestrator default to "localhost" / "dedicated" / "local" in argparse, they always overwrite whatever was in apipod.json, even when I don't pass them on the command line. So the CMD fix in this PR works, but only if I remember to pass --provider runpod. With a plain apipod --build, the provider has already been replaced with localhost by the time the template renders, so my conditional never fires.

I also checked the tests folder — there are tests for infer_profile() and for the APIPod() runtime factory, but I didn't find anything that checks what comes out of render_dockerfile() from a CLI run. That might be why this wasn't caught.

3. Issues from yesterday

I didn't find new issues in GitHub for APIPod / qwen-models / face2face. I assumed you meant the frictions list I've been keeping in qwen-models/FRICTIONS.md — the CMD one is #14 there. Should I add the CLI-overriding one as #15?

What I think I should do next (want to confirm before pushing anything)

If I'm reading you right, this PR needs two more bits:

In cli.py: change the defaults of --provider / --compute / --orchestrator to None, and only overwrite config_data when the flag was actually passed.
Add a small test that builds a fake project with apipod.json saying provider: runpod and checks the rendered Dockerfile has APIPOD_PROVIDER="runpod" and the python CMD.

Is that what you had in mind, or am I going further than you wanted for this PR?

w4hns1nn

In general seems reasonable, but without testing or IDE I can't tell if that code actually makes sense

w4hns1nn · 2026-06-04T11:40:51Z

-            "found_config": False
+            "title": "apipod-service",
+            "found_config": False,
+            "orchestrator": "local",


Why was the python bookworm image been introduced like a default?

Good catch. Those defaults were already in DeploymentConfig and in the scanner's .get(default) calls, so I was duplicating. Dropped them from the detector dict and added a comment so we don't add them back later.

w4hns1nn · 2026-06-04T12:13:24Z

What is a starter file and readme?

Fair, the name was vague. Renamed to _write_deploy_dir_helpers and added a docstring explaining the two files it drops (README from the starter template and .dockerignore) and that user-created versions are preserved.

…lag is omitted A plain `apipod --build` was overwriting the values from apipod.json with argparse defaults ("localhost"/"dedicated"/"local"), so a service configured with provider: runpod was rendering uvicorn as CMD in the Dockerfile. The CMD conditional in docker_template.j2 already routed correctly when the provider arrived as "runpod", but it never did because the CLI replaced it before the template was rendered. Fix: drop the argparse defaults to None and only merge into config_data when the user actually passed the flag. run_start keeps the historical defaults ("local"/"dedicated"/"localhost") as a fallback so `apipod --start` without flags still behaves the same. Tests cover render_dockerfile against both provider values plus the minimal profile, where the CMD is hardcoded but the ENV must still propagate.

w4hns1nn · 2026-06-08T07:06:30Z

Hey,

I looked at it again and the confusion comes from the table; between what you will experience on local development vs what will happen on apipod deploy.

If you are locally debugging your code and no setting is set, the service will start as fastapi service as it should, and you will test your service by navigating to the /docs route on ýour server. If you want to experience how the service would look like deployed on socaity you will set the --localhost flag to emulate the service with the queue.
The table does not show how the dockerfile will look like without any settings. And this must be different than on local debugging. If you do not specify any settings and use apipod --build for socaity it must set it to runpod --serverless by default and also change the run command in the Dockerfile, because socaity by default will deploy to runpod serverless.

We need to find a clean way to seperate "build/deploy" in the dockerfile and local "testing" for developery to avoid this confusion.

@nataliasocaity @K4rlosReyes
Please work out a solution to make it clean for developers that want to write their code and debug locally;
vs. what happens on apipod build, scan and deploy commands.
Note on apipod --deploy we also need to overwrite other settings (when deployed via socaity).

Then update the PR and I will have another look at it.

w4hns1nn · 2026-06-08T07:14:10Z

+        "python:3.11-slim",
        "python:3.10-slim",
-        "nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04", # Standard CUDA Runtime
+        "ghcr.io/astral-sh/uv:python3.12-bookworm-slim",


This change remains mysterical its probably an error

Removed it from DEFAULT_IMAGES and from the preferred list in _load_images. Heads up: profile.py:173 still returns the astral image for PROFILE_SERVERLESS_MINIMAL because docker_template_minimal.j2 uses uv. If you want to drop that path entirely, let me know and I'll open a separate PR for it.

w4hns1nn · 2026-06-08T07:14:41Z

+                content = json.load(handle)
+            if isinstance(content, dict):
+                keys = content.keys()
+                model_keys = {


Some comments like before help to explain whats going on

Added a docstring to _is_model_json walking through the three-layer check (filename whitelist, blocklist, content sniff) and an inline comment on the HF model_keys set.

w4hns1nn · 2026-06-08T07:16:03Z

+        if profile == PROFILE_SERVERLESS_MINIMAL:
+            version = str(config.get("python_version") or "3.12")
+            for img in self.images:
+                if f"python{version}" in img and "astral-sh/uv" in img:


same astral image thing

Removed in the same commit as the DEFAULT_IMAGES change. See the note in the line 22 thread about the minimal-profile path that still depends on uv.

w4hns1nn · 2026-06-08T07:17:09Z

        print(f"Dockerfile created at {dockerfile_path}")
        return dockerfile_path

+    def write_project_dockerignore(self) -> Path:


Good addon!

Note, that when we add the standardized method to load models we should include the model in the docker container for faster boot times.

Thanks. Added a TODO next to COPY . . in docker_template.j2 so we don't lose track of baking model files into the image once the standardized model-loading hook lands.

w4hns1nn · 2026-06-08T07:18:08Z

This was the bug you experienced,
please see the global comment I made about this; and workout a clean strategy for doing it in code and describing it not confusing to the user...

Shortened the rationale (the 5-line block was overkill), kept the one-liner about runpod returning a non-ASGI router. The clean strategy for the bigger build/deploy/dev separation will go in the separate PR I outlined in the top-level comment.

nataliasocaity · 2026-06-08T08:55:14Z

Hey Matthias, my proposal is to split what the CLI does today into three clean cases instead of mixing them:

python service.py or apipod --start with nothing: local dev mode, plain FastAPI on /docs. Same defaults the scanner writes today (local/dedicated/localhost). This already works, we don't touch it.
apipod --start --emulate (new shortcut): run locally with the queue, so the dev can see how it'll behave on socaity before building anything.
apipod --build: this is where it breaks today. The scanner defaults are wrong for build because they describe local dev, not deploy. My idea: the first time the dev runs --build without a target set, the CLI asks them which target (socaity-runpod / self-hosted runpod / local docker) and saves it. Next runs are silent. No magic defaults that flip their config behind their back.
apipod --deploy (doesn't exist yet): build + push + override the socaity-managed settings the dev shouldn't have to know. Same command for external devs and for us, no separate flow.

The CLI override fix from this PR stays as is (flags don't overwrite the json unless passed).

On the 7 review comments you left: I'll address those here in this PR so it merges clean. The bigger architectural change (--emulate, --build target prompt, --deploy) will go in a separate new PR so they don't get tangled.

One thing I'm not sure about: is "ask the first time, then remember" the right UX for the build target, or do you want a hard default of socaity-runpod with no question?

@K4rlosReyes how do you see this? Anything I'm missing from the runtime side?

- entrypoint.py: drop duplicated orchestrator/compute/provider defaults from the detector result. The DeploymentConfig dataclass and the scanner .get(default) calls already own those, so the detector only carries fields it actually populates from user code. - scanner.py: rename _write_starter_files -> _write_deploy_dir_helpers, add docstring explaining the two files it drops (README.md from the starter template, and .dockerignore) and that user-created versions are preserved. - framework.py: add docstring + inline note to _is_model_json so the three-layer heuristic (filename whitelist, blocklist, content sniff) is readable without tracing the function. - docker_factory.py: remove the astral-sh/uv base image from DEFAULT_IMAGES and from the _load_images preferred list, and drop the astral-favoured branch in recommend_image's minimal-profile lookup. The minimal Jinja template still depends on uv so recommend_base_image in profile.py keeps returning the astral image for that profile only. - docker_template.j2: shorten the CMD comment block (the 5-line runpod rationale was verbose), and add a TODO near COPY . . for baking model files into the image once APIPod ships a standardized model-loading hook (per Matthias's note on factory:147). Tests: test_render_dockerfile.py + test_deploy_profile.py, 8/8 passing.

w4hns1nn and others added 7 commits April 8, 2026 18:47

Merge pull request #10 from SocAIty/dev

e078851

Dev

bump version to 1.0.5 [skip ci]

dd7d25a

Merge pull request #12 from SocAIty/dev

ae051bd

Dev

bump version to 1.0.6 [skip ci]

76a2d18

Fix/runpod and cli (#13)

3fb820b

* fix: runpod router uses latest job result * dev: improved scan and build * fix: scan flow was detecting ml when there was not

bump version to 1.0.7 [skip ci]

a87d571

nataliasocaity requested a review from K4rlosReyes June 1, 2026 10:18

nataliasocaity requested a review from w4hns1nn June 2, 2026 08:46

w4hns1nn reviewed Jun 4, 2026

View reviewed changes

w4hns1nn reviewed Jun 8, 2026

View reviewed changes

Uh oh!

Conversation

nataliasocaity commented Jun 1, 2026

Summary

Fix

Changes

Verified locally

Test plan

Uh oh!

w4hns1nn commented Jun 1, 2026

Uh oh!

nataliasocaity commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

w4hns1nn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

w4hns1nn commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nataliasocaity commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nataliasocaity commented Jun 2, 2026 •

edited

Loading

w4hns1nn commented Jun 8, 2026 •

edited

Loading