Skip to content

feat: add Phase 1 Python client package over the server#411

Merged
inureyes merged 4 commits into
mainfrom
feature/issue-407-python-client
Jun 23, 2026
Merged

feat: add Phase 1 Python client package over the server#411
inureyes merged 4 commits into
mainfrom
feature/issue-407-python-client

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Phase 1 of Python integration: a pure-Python client package (mlxcel) under a new top-level python/ directory that drives the existing OpenAI-compatible mlxcel serve server. It spawns and supervises a local server process (managed mode) or connects to a running one (connect mode), auto-discovers the served model id, and exposes the raw openai client as an escape hatch for the full API surface. There are zero changes to the Rust inference core; this is Python, a CI workflow, and docs only.

What changed

  • python/src/mlxcel/_server.py: ManagedServer does binary discovery (binary= / MLXCEL_BIN / PATH), transport selection (Unix domain socket by default on POSIX with a short sun_path under /tmp, TCP elsewhere or when host=/port= is given), subprocess spawn, authoritative /health readiness polling with backoff and child-process liveness checks (so a first-run weight download does not fail spuriously and an early exit raises MlxcelServerError with the captured stderr tail), stderr forwarding to the mlxcel.server logger on a daemon thread, and graceful SIGTERM-then-SIGKILL shutdown with socket cleanup, atexit, and a __del__ finalizer.
  • python/src/mlxcel/_client.py (sync LLM) and python/src/mlxcel/_async_client.py (AsyncLLM): wrap the OpenAI SDK over a single TCP-or-UDS httpx transport path with explicit timeouts (required when injecting a custom http_client). Shared mode selection, base-URL normalization, and message-type narrowing live in python/src/mlxcel/_common.py. Public methods: generate, stream, chat, chat_stream, models, tokenize, detokenize, plus model and openai_client properties and close(); tokenize/detokenize call the native /tokenize and /detokenize routes (no /v1 prefix) through the underlying httpx client. The model id is discovered once from /v1/models and cached. Mode rules: a model selects managed mode (with socket= as the optional bind path), base_url= or socket= without a model selects connect mode, and a model plus a base_url/transport connect target is an error.
  • python/src/mlxcel/_sampling.py: maps Python kwargs to OpenAI request fields and routes server-specific knobs (top_k, min_p, repetition_penalty, DRY) and any unknown keys through extra_body, with response_format passthrough; a caller-supplied extra_body= wins on conflict.
  • python/src/mlxcel/errors.py: MlxcelError, MlxcelServerError (carries the stderr tail), MlxcelTimeoutError. HTTP and API errors propagate as native openai SDK exceptions rather than being hidden.
  • python/pyproject.toml: hatchling backend, src layout, requires-python >=3.9, deps openai>=1.40 and httpx>=0.27, a dev extra with pytest/ruff/mypy, ruff/mypy config, and the e2e pytest marker. Ships py.typed. Distribution and import name mlxcel, version 0.1.0.
  • python/tests/: fake_server.py is a stdlib-only HTTP server (binds UDS or TCP, serves /health 503-then-200, /v1/models, /v1/completions and /v1/chat/completions including SSE streaming variants, /tokenize, /detokenize). test_client_mock.py uses httpx.MockTransport (no real server) to assert generate/stream/chat/tokenize behavior, sampling mapping, model auto-discovery, and error mapping. test_lifecycle.py spawns fake_server.py via ManagedServer and exercises discovery, spawn, health-poll, ready, log capture, graceful shutdown, the early-exit failure path, and connect-mode-to-a-running-UDS-server. test_e2e.py is marked @pytest.mark.e2e and skipped unless MLXCEL_BIN is set.
  • python/examples/: quickstart.py, streaming.py, structured_output.py. python/README.md: install and usage. python/.gitignore: venv, caches, build artifacts.
  • .github/workflows/python.yml: runs ruff check, ruff format --check, mypy, and pytest (unit + lifecycle; e2e skipped) on ubuntu-latest, triggered only on python/** and the workflow file, independent of the Rust CI.
  • Docs: new docs/python-client.md (both modes, streaming, chat, structured output, the openai_client escape hatch, async usage, troubleshooting incl. the socket-path-length note), linked from docs/README.md and the root README.md, with an English nav entry in mkdocs.yml. The Korean nav and translation are left to the finalizer.

Test plan

Verified in a clean venv created outside the repo (/tmp/mlxcel-venv) with pip install -e python[dev]. The Rust binary was not built and the e2e test was not run (this host is Linux + CUDA; mlxcel targets Apple Silicon), so the real-binary E2E path stays marked @pytest.mark.e2e and skipped.

  • ruff check python -> All checks passed
  • ruff format --check python -> 14 files already formatted
  • mypy python/src -> Success: no issues found in 7 source files
  • pytest python/tests -m "not e2e" -> 29 passed, 2 deselected
  • import mlxcel works; mlxcel.__version__ == "0.1.0"; LLM/AsyncLLM/error types exported
  • No .rs or Cargo.* files changed

Closes #407

Add a pure-Python client package under python/ that drives the existing OpenAI-compatible mlxcel server. It either spawns and supervises a local `mlxcel serve` process (managed mode) or connects to a running one (connect mode), auto-discovers the served model id from /v1/models, and exposes the raw openai client as an escape hatch. No changes to the Rust inference core.

Package contents:

- src/mlxcel/_server.py: ManagedServer handles binary discovery (binary= / MLXCEL_BIN / PATH), transport selection (Unix socket default on POSIX with a short sun_path under /tmp, TCP elsewhere or on request), subprocess spawn, /health readiness polling with backoff and child-liveness checks, stderr forwarding to the mlxcel.server logger, and graceful SIGTERM-then-SIGKILL shutdown with socket cleanup, atexit, and a finalizer.
- src/mlxcel/_client.py and src/mlxcel/_async_client.py: the synchronous LLM and asynchronous AsyncLLM, each wrapping the OpenAI SDK over a TCP or UDS httpx transport with explicit timeouts. Shared mode-selection, base-URL handling, and message-type narrowing live in src/mlxcel/_common.py. Methods: generate, stream, chat, chat_stream, models, tokenize, detokenize, plus model and openai_client properties and close(). tokenize/detokenize call the native /tokenize and /detokenize routes through the underlying httpx client.
- src/mlxcel/_sampling.py: maps Python kwargs to OpenAI request fields and routes server-specific knobs (top_k, min_p, repetition_penalty, DRY) and unknown keys through extra_body, with response_format passthrough.
- src/mlxcel/errors.py: MlxcelError, MlxcelServerError (carries stderr tail), MlxcelTimeoutError. HTTP and API errors propagate as native openai exceptions.

Tests, CI, and docs:

- tests/: stdlib-only fake_server.py (UDS or TCP, /health 503-then-200, canned /v1/* incl. SSE, /tokenize, /detokenize); test_client_mock.py uses httpx.MockTransport; test_lifecycle.py spawns the fake server via ManagedServer; test_e2e.py is marked e2e and skipped unless MLXCEL_BIN is set.
- .github/workflows/python.yml runs ruff, ruff format --check, mypy, and pytest (unit + lifecycle) on ubuntu-latest, independent of the Rust CI and triggered only on python/** changes.
- docs/python-client.md documents both modes, streaming, chat, structured output, the openai_client escape hatch, async usage, and the socket-path-length note; linked from README.md, docs/README.md, and the mkdocs nav.
@inureyes inureyes added status:review Under review type:enhancement New features, capabilities, or significant additions priority:high High priority labels Jun 23, 2026
@inureyes

Copy link
Copy Markdown
Member Author

Implementation Review Summary

Intent

Phase 1 pure-Python mlxcel client over the existing OpenAI-compatible server: managed (spawn mlxcel serve) and connect modes, UDS/TCP transport, model auto-discovery, openai_client escape hatch. Zero Rust changes.

Findings Addressed (auto-fixed on this branch, not yet committed)

  • Native /tokenize and /detokenize sent no Authorization header, so they returned 401 whenever api_key was set. Now attach Bearer <key> on the raw httpx requests in both LLM and AsyncLLM, and omit the header entirely when no key is configured (no-auth path preserved). (HIGH)
  • response_format lived in the shared OpenAI-field list, so generate(..., response_format=...) raised TypeError (the completions.create endpoint rejects it). build_params is now endpoint-aware: chat keeps it top-level, completions route it through extra_body. (HIGH)
  • The API key was passed to the child via --api-key <secret> on argv (visible in ps / /proc/<pid>/cmdline, and logged at DEBUG). It is now passed through the LLAMA_API_KEY environment variable, and the launch log no longer carries the secret. (HIGH, security)

Remaining Items (report only, no code change)

  • mkdocs nav adds user-guide/python-client.md, but with docs_dir: docs/en that resolves to docs/en/user-guide/python-client.md, which has no matching source file. The whole docs/en/ tree is untracked in git (every existing nav sibling is too, maintained out-of-band), and there is no docs CI or --strict build, so no automated gate breaks. The tracked GitHub-facing page docs/python-client.md and its README.md / docs/README.md links are correct. Finalizer should sync the page into docs/en/user-guide/ alongside the Korean nav/translation. (LOW)
  • _server._probe_once catches ConnectError/ConnectTimeout/ReadError but not RemoteProtocolError/ReadTimeout; bounded by the 5s probe timeout and the per-iteration child-liveness check, so impact is minimal. Optional hardening. (LOW)

Verification

  • All stated requirements implemented (sync + async, all listed methods, sampling + extra_body + response_format, error types, mock/lifecycle/e2e tests, CI workflow, docs, examples, py.typed, .gitignore)
  • No placeholder/mock/orphaned code; every module imported and wired through __init__
  • Integrated into the package code flow (clients use _server/_common/_sampling/errors)
  • Project conventions followed (3.9-compatible typing, Google docstrings, ruff/mypy strict clean)
  • Existing modules reused (_common shared by both clients; OpenAI SDK + httpx, no reinvention)
  • No unintended structural changes; zero .rs / Cargo.* changes
  • Tests pass: ruff check clean, ruff format --check clean, mypy python/src clean, pytest -m "not e2e" 37 passed / 2 deselected (e2e correctly gated)

Fixes are staged on feature/issue-407-python-client and not committed; commit/push left to the maintainer per review policy.

inureyes added 3 commits June 24, 2026 06:33
Pass the server API key through the LLAMA_API_KEY environment variable instead of argv so it is not exposed via ps or /proc/<pid>/cmdline, and attach a Bearer header on the native /tokenize and /detokenize posts that bypass the OpenAI SDK auth injection. Route response_format through extra_body for the plain completions endpoint (the SDK rejects it as a top-level field there) while keeping it top-level for chat. Add regression tests for auth-header presence and absence, response_format routing, and the API key staying out of argv.
Wrap the resource-creating section of LLM.__init__ and AsyncLLM.__init__ so a failure after the http client is built (or the managed subprocess is spawned), for example an empty /v1/models discovery response, deterministically tears those resources down instead of leaking them until garbage collection. The sync client reuses its idempotent close(); the async client does synchronous cleanup because it cannot await in __init__. Add a best-effort __del__ to AsyncLLM mirroring the sync client so a never-awaited close() still stops the managed server and drops the async http client pool. await close() and async-with remain the correct API.
…y failure

AsyncLLM.__init__ called is_managed() before self._closed = False, so __del__ raised AttributeError when the constructor failed at argument validation (ambiguous-args path). Move _closed initialization first.

Add tests for async chat_stream, models(), openai_client escape hatch, model property before resolution, and mode-validation errors on AsyncLLM.
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization Complete

Summary

Tests: Added 7 tests covering previously untested async paths:

  • test_async_chat_stream (async chat_stream was the only unexercised generation method)
  • test_async_models (async models() list)
  • test_async_openai_client_escape_hatch (async openai_client property type check)
  • test_async_model_property_raises_before_resolution (AsyncLLM.model before any request)
  • test_async_ambiguous_args_is_error and test_async_no_args_is_error (mode validation on AsyncLLM)

Bug fix: AsyncLLM.__init__ called is_managed() before self._closed = False, so __del__ raised AttributeError when the constructor failed at argument validation. Moved _closed initialization to the top of __init__. The new mode-validation tests caught this and now pass cleanly.

Docs (GitHub-facing): Added a "Security: multi-user hosts" section to docs/python-client.md between the Connect Mode and Streaming sections, recommending an explicit socket= path under $XDG_RUNTIME_DIR on shared machines.

Docs (MkDocs): Created docs/en/user-guide/python-client.md and docs/ko/user-guide/python-client.md in the internal docs tree (not tracked in the public repo, matching the established pattern). Added the Korean nav entry Python 클라이언트: user-guide/python-client.md to mkdocs.ko.yml.

Lint/Format: All checks pass. No Rust files touched.

Final gate results

ruff check python    -> All checks passed
ruff format --check  -> 14 files already formatted
mypy python/src      -> Success: no issues found in 7 source files
pytest (not e2e)     -> 43 passed, 2 deselected

Ready for merge.

@inureyes inureyes added status:done Completed and removed status:review Under review labels Jun 23, 2026
@inureyes inureyes merged commit 8e5d426 into main Jun 23, 2026
6 checks passed
@inureyes inureyes deleted the feature/issue-407-python-client branch June 23, 2026 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:high High priority status:done Completed type:enhancement New features, capabilities, or significant additions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Phase 1 Python integration via a thin mlxcel client package over the OpenAI-compatible server

1 participant