Skip to content

Fix Dockerfile CMD for provider=runpod (endless re-provisioning loop)#14

Open
nataliasocaity wants to merge 9 commits into
devfrom
fix/dockerfile-cmd-runpod
Open

Fix Dockerfile CMD for provider=runpod (endless re-provisioning loop)#14
nataliasocaity wants to merge 9 commits into
devfrom
fix/dockerfile-cmd-runpod

Conversation

@nataliasocaity
Copy link
Copy Markdown

Summary

The Dockerfile generated by apipod --build used CMD ["uvicorn", "service:app", ...]
for every provider. But when APIPod(provider="runpod", ...) is used (the standard
RunPod serverless config), the factory returns a SocaityRunpodRouter that is not
ASGI-callable
. uvicorn imports it, the container starts cleanly, then every request
500s with:

TypeError: 'SocaityRunpodRouter' object is not callable

In RunPod this manifests as endless re-provisioning — the orchestrator sees the
failing health checks, tears the container down, retries forever. This affects every
serverless RunPod service built with apipod --build, not just Qwen.

Fix

The service's own main() already does the right thing: it calls app.start(port, host),
which for provider=runpod hands off to runpod's serverless worker harness. So the
CMD just needs to launch the script directly when provider=runpod:

CMD ["python", "service.py", "--rp_api_host", "0.0.0.0", "--rp_api_port", "8000"]

This matches the pattern already used in docker_template_minimal.j2. uvicorn is
preserved for all other providers.

Changes

  • apipod/deploy/docker_template.j2 — CMD is now conditional on provider
  • apipod/deploy/docker_factory.py — passes entrypoint_script to the template context

Verified locally

Rebuilt the qwen-models container with the patched Dockerfile:

Before (uvicorn):

INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000
ERROR:    Exception in ASGI application
TypeError: 'SocaityRunpodRouter' object is not callable

After (python script):

--- Starting Serverless Worker |  Version 1.9.0 ---
WARN   | test_input.json not found, exiting.

The worker boots into the correct runpod harness; in production it picks up jobs from
the orchestrator instead of looking for a local test_input.json.

Test plan

  • Run apipod --build on a service with provider=runpod and confirm the CMD is python-based
  • Run apipod --build on a service with provider=localhost/scaleway and confirm CMD is still uvicorn
  • Deploy on RunPod and confirm no more re-provisioning loop

w4hns1nn and others added 7 commits April 8, 2026 18:47
* fix: runpod router uses latest job result

* dev: improved scan and build

* fix: scan flow was detecting ml when there was not
The rendered Dockerfile used `CMD uvicorn service:app` for every provider,
but APIPod(provider="runpod") returns a SocaityRunpodRouter that is not an
ASGI application. uvicorn imports it, the container looks healthy from the
outside, then every request 500s with:

    TypeError: 'SocaityRunpodRouter' object is not callable

In RunPod's deploy flow this manifests as endless re-provisioning: the
orchestrator sees the failing requests, tears the container down, retries.

Switch the CMD to `python <service>.py --rp_api_host 0.0.0.0 --rp_api_port 8000`
when provider=runpod, so app.start() hands off to runpod's worker harness
(verified locally: `Starting Serverless Worker | Version 1.9.0`).

uvicorn behavior is preserved for all other providers.
@w4hns1nn
Copy link
Copy Markdown
Contributor

w4hns1nn commented Jun 1, 2026

This works and is correct.. please checkout face2face how it was done there and if it was done there in a generalized way.

Apipod built should create the docker file correctly though based on -- settings. Needs a check

@nataliasocaity
Copy link
Copy Markdown
Author

nataliasocaity commented Jun 2, 2026

Thanks for the pointers, I went through the three things before touching anything else. Wanted to check with you that I'm reading this right.

1. face2face

I cloned it and looked at apipod-deploy/Dockerfile and apipod.json. It doesn't look generalized to me — the Dockerfile is hand-written, the paths are Windows-style (face2face\server.py), and it ends with CMD ["uvicorn", "face2face\server:app", ...], which would hit the same issue Qwen had if it got rebuilt today. So I don't think there's anything to copy from there. Let me know if I'm missing something.

2. "apipod build should create the Dockerfile correctly based on the settings"

I think I see what you mean. I set up a small test project and ran apipod --build three times to compare:

apipod.json says CLI flag I passed What ended up in the Dockerfile
provider: runpod (nothing) ENV APIPOD_PROVIDER="localhost" + CMD uvicorn
provider: runpod (minimal profile) (nothing) ENV APIPOD_PROVIDER="localhost" + CMD python ... (CMD ended up right by accident, ENVs still wrong)
provider: runpod --provider runpod ENV APIPOD_PROVIDER="runpod" + CMD python ... (this is what we want)

I think the reason is in apipod/cli.py:133-135:

config_data["orchestrator"] = args.orchestrator
config_data["compute"]      = args.compute
config_data["provider"]     = args.provider

Since --provider, --compute and --orchestrator default to "localhost" / "dedicated" / "local" in argparse, they always overwrite whatever was in apipod.json, even when I don't pass them on the command line. So the CMD fix in this PR works, but only if I remember to pass --provider runpod. With a plain apipod --build, the provider has already been replaced with localhost by the time the template renders, so my conditional never fires.

I also checked the tests folder — there are tests for infer_profile() and for the APIPod() runtime factory, but I didn't find anything that checks what comes out of render_dockerfile() from a CLI run. That might be why this wasn't caught.

3. Issues from yesterday

I didn't find new issues in GitHub for APIPod / qwen-models / face2face. I assumed you meant the frictions list I've been keeping in qwen-models/FRICTIONS.md — the CMD one is #14 there. Should I add the CLI-overriding one as #15?

What I think I should do next (want to confirm before pushing anything)

If I'm reading you right, this PR needs two more bits:

  • In cli.py: change the defaults of --provider / --compute / --orchestrator to None, and only overwrite config_data when the flag was actually passed.
  • Add a small test that builds a fake project with apipod.json saying provider: runpod and checks the rendered Dockerfile has APIPOD_PROVIDER="runpod" and the python CMD.

Is that what you had in mind, or am I going further than you wanted for this PR?

@nataliasocaity nataliasocaity requested a review from w4hns1nn June 2, 2026 08:46
Copy link
Copy Markdown
Contributor

@w4hns1nn w4hns1nn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general seems reasonable, but without testing or IDE I can't tell if that code actually makes sense

Comment thread apipod/deploy/detectors/entrypoint.py Outdated
"found_config": False
"title": "apipod-service",
"found_config": False,
"orchestrator": "local",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the python bookworm image been introduced like a default?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Those defaults were already in DeploymentConfig and in the scanner's .get(default) calls, so I was duplicating. Dropped them from the detector dict and added a comment so we don't add them back later.

Comment thread apipod/deploy/scanner.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a starter file and readme?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, the name was vague. Renamed to _write_deploy_dir_helpers and added a docstring explaining the two files it drops (README from the starter template and .dockerignore) and that user-created versions are preserved.

…lag is omitted

A plain `apipod --build` was overwriting the values from apipod.json with
argparse defaults ("localhost"/"dedicated"/"local"), so a service configured
with provider: runpod was rendering uvicorn as CMD in the Dockerfile. The
CMD conditional in docker_template.j2 already routed correctly when the
provider arrived as "runpod", but it never did because the CLI replaced it
before the template was rendered.

Fix: drop the argparse defaults to None and only merge into config_data when
the user actually passed the flag. run_start keeps the historical defaults
("local"/"dedicated"/"localhost") as a fallback so `apipod --start` without
flags still behaves the same.

Tests cover render_dockerfile against both provider values plus the minimal
profile, where the CMD is hardcoded but the ENV must still propagate.
@w4hns1nn
Copy link
Copy Markdown
Contributor

w4hns1nn commented Jun 8, 2026

Hey,

I looked at it again and the confusion comes from the table; between what you will experience on local development vs what will happen on apipod deploy.

If you are locally debugging your code and no setting is set, the service will start as fastapi service as it should, and you will test your service by navigating to the /docs route on ýour server. If you want to experience how the service would look like deployed on socaity you will set the --localhost flag to emulate the service with the queue.
The table does not show how the dockerfile will look like without any settings. And this must be different than on local debugging. If you do not specify any settings and use apipod --build for socaity it must set it to runpod --serverless by default and also change the run command in the Dockerfile, because socaity by default will deploy to runpod serverless.

We need to find a clean way to seperate "build/deploy" in the dockerfile and local "testing" for developery to avoid this confusion.

@nataliasocaity @K4rlosReyes
Please work out a solution to make it clean for developers that want to write their code and debug locally;
vs. what happens on apipod build, scan and deploy commands.
Note on apipod --deploy we also need to overwrite other settings (when deployed via socaity).

Then update the PR and I will have another look at it.

Comment thread apipod/deploy/docker_factory.py Outdated
"python:3.11-slim",
"python:3.10-slim",
"nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04", # Standard CUDA Runtime
"ghcr.io/astral-sh/uv:python3.12-bookworm-slim",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change remains mysterical its probably an error

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it from DEFAULT_IMAGES and from the preferred list in _load_images. Heads up: profile.py:173 still returns the astral image for PROFILE_SERVERLESS_MINIMAL because docker_template_minimal.j2 uses uv. If you want to drop that path entirely, let me know and I'll open a separate PR for it.

content = json.load(handle)
if isinstance(content, dict):
keys = content.keys()
model_keys = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments like before help to explain whats going on

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a docstring to _is_model_json walking through the three-layer check (filename whitelist, blocklist, content sniff) and an inline comment on the HF model_keys set.

Comment thread apipod/deploy/docker_factory.py Outdated
if profile == PROFILE_SERVERLESS_MINIMAL:
version = str(config.get("python_version") or "3.12")
for img in self.images:
if f"python{version}" in img and "astral-sh/uv" in img:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same astral image thing

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in the same commit as the DEFAULT_IMAGES change. See the note in the line 22 thread about the minimal-profile path that still depends on uv.

print(f"Dockerfile created at {dockerfile_path}")
return dockerfile_path

def write_project_dockerignore(self) -> Path:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addon!

Note, that when we add the standardized method to load models we should include the model in the docker container for faster boot times.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added a TODO next to COPY . . in docker_template.j2 so we don't lose track of baking model files into the image once the standardized model-loading hook lands.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the bug you experienced,
please see the global comment I made about this; and workout a clean strategy for doing it in code and describing it not confusing to the user...

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shortened the rationale (the 5-line block was overkill), kept the one-liner about runpod returning a non-ASGI router. The clean strategy for the bigger build/deploy/dev separation will go in the separate PR I outlined in the top-level comment.

@nataliasocaity
Copy link
Copy Markdown
Author

Hey Matthias, my proposal is to split what the CLI does today into three clean cases instead of mixing them:

  • python service.py or apipod --start with nothing: local dev mode, plain FastAPI on /docs. Same defaults the scanner writes today (local/dedicated/localhost). This already works, we don't touch it.
  • apipod --start --emulate (new shortcut): run locally with the queue, so the dev can see how it'll behave on socaity before building anything.
  • apipod --build: this is where it breaks today. The scanner defaults are wrong for build because they describe local dev, not deploy. My idea: the first time the dev runs --build without a target set, the CLI asks them which target (socaity-runpod / self-hosted runpod / local docker) and saves it. Next runs are silent. No magic defaults that flip their config behind their back.
  • apipod --deploy (doesn't exist yet): build + push + override the socaity-managed settings the dev shouldn't have to know. Same command for external devs and for us, no separate flow.

The CLI override fix from this PR stays as is (flags don't overwrite the json unless passed).

On the 7 review comments you left: I'll address those here in this PR so it merges clean. The bigger architectural change (--emulate, --build target prompt, --deploy) will go in a separate new PR so they don't get tangled.

One thing I'm not sure about: is "ask the first time, then remember" the right UX for the build target, or do you want a hard default of socaity-runpod with no question?

@K4rlosReyes how do you see this? Anything I'm missing from the runtime side?

- entrypoint.py: drop duplicated orchestrator/compute/provider defaults
  from the detector result. The DeploymentConfig dataclass and the scanner
  .get(default) calls already own those, so the detector only carries
  fields it actually populates from user code.
- scanner.py: rename _write_starter_files -> _write_deploy_dir_helpers,
  add docstring explaining the two files it drops (README.md from the
  starter template, and .dockerignore) and that user-created versions
  are preserved.
- framework.py: add docstring + inline note to _is_model_json so the
  three-layer heuristic (filename whitelist, blocklist, content sniff)
  is readable without tracing the function.
- docker_factory.py: remove the astral-sh/uv base image from
  DEFAULT_IMAGES and from the _load_images preferred list, and drop the
  astral-favoured branch in recommend_image's minimal-profile lookup.
  The minimal Jinja template still depends on uv so recommend_base_image
  in profile.py keeps returning the astral image for that profile only.
- docker_template.j2: shorten the CMD comment block (the 5-line runpod
  rationale was verbose), and add a TODO near COPY . . for baking model
  files into the image once APIPod ships a standardized model-loading
  hook (per Matthias's note on factory:147).

Tests: test_render_dockerfile.py + test_deploy_profile.py, 8/8 passing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants