feat: add Phase 1 Python client package over the server#411
Conversation
Add a pure-Python client package under python/ that drives the existing OpenAI-compatible mlxcel server. It either spawns and supervises a local `mlxcel serve` process (managed mode) or connects to a running one (connect mode), auto-discovers the served model id from /v1/models, and exposes the raw openai client as an escape hatch. No changes to the Rust inference core. Package contents: - src/mlxcel/_server.py: ManagedServer handles binary discovery (binary= / MLXCEL_BIN / PATH), transport selection (Unix socket default on POSIX with a short sun_path under /tmp, TCP elsewhere or on request), subprocess spawn, /health readiness polling with backoff and child-liveness checks, stderr forwarding to the mlxcel.server logger, and graceful SIGTERM-then-SIGKILL shutdown with socket cleanup, atexit, and a finalizer. - src/mlxcel/_client.py and src/mlxcel/_async_client.py: the synchronous LLM and asynchronous AsyncLLM, each wrapping the OpenAI SDK over a TCP or UDS httpx transport with explicit timeouts. Shared mode-selection, base-URL handling, and message-type narrowing live in src/mlxcel/_common.py. Methods: generate, stream, chat, chat_stream, models, tokenize, detokenize, plus model and openai_client properties and close(). tokenize/detokenize call the native /tokenize and /detokenize routes through the underlying httpx client. - src/mlxcel/_sampling.py: maps Python kwargs to OpenAI request fields and routes server-specific knobs (top_k, min_p, repetition_penalty, DRY) and unknown keys through extra_body, with response_format passthrough. - src/mlxcel/errors.py: MlxcelError, MlxcelServerError (carries stderr tail), MlxcelTimeoutError. HTTP and API errors propagate as native openai exceptions. Tests, CI, and docs: - tests/: stdlib-only fake_server.py (UDS or TCP, /health 503-then-200, canned /v1/* incl. SSE, /tokenize, /detokenize); test_client_mock.py uses httpx.MockTransport; test_lifecycle.py spawns the fake server via ManagedServer; test_e2e.py is marked e2e and skipped unless MLXCEL_BIN is set. - .github/workflows/python.yml runs ruff, ruff format --check, mypy, and pytest (unit + lifecycle) on ubuntu-latest, independent of the Rust CI and triggered only on python/** changes. - docs/python-client.md documents both modes, streaming, chat, structured output, the openai_client escape hatch, async usage, and the socket-path-length note; linked from README.md, docs/README.md, and the mkdocs nav.
Implementation Review SummaryIntent
Findings Addressed (auto-fixed on this branch, not yet committed)
Remaining Items (report only, no code change)
Verification
Fixes are staged on |
Pass the server API key through the LLAMA_API_KEY environment variable instead of argv so it is not exposed via ps or /proc/<pid>/cmdline, and attach a Bearer header on the native /tokenize and /detokenize posts that bypass the OpenAI SDK auth injection. Route response_format through extra_body for the plain completions endpoint (the SDK rejects it as a top-level field there) while keeping it top-level for chat. Add regression tests for auth-header presence and absence, response_format routing, and the API key staying out of argv.
Wrap the resource-creating section of LLM.__init__ and AsyncLLM.__init__ so a failure after the http client is built (or the managed subprocess is spawned), for example an empty /v1/models discovery response, deterministically tears those resources down instead of leaking them until garbage collection. The sync client reuses its idempotent close(); the async client does synchronous cleanup because it cannot await in __init__. Add a best-effort __del__ to AsyncLLM mirroring the sync client so a never-awaited close() still stops the managed server and drops the async http client pool. await close() and async-with remain the correct API.
…y failure AsyncLLM.__init__ called is_managed() before self._closed = False, so __del__ raised AttributeError when the constructor failed at argument validation (ambiguous-args path). Move _closed initialization first. Add tests for async chat_stream, models(), openai_client escape hatch, model property before resolution, and mode-validation errors on AsyncLLM.
PR Finalization CompleteSummaryTests: Added 7 tests covering previously untested async paths:
Bug fix: Docs (GitHub-facing): Added a "Security: multi-user hosts" section to Docs (MkDocs): Created Lint/Format: All checks pass. No Rust files touched. Final gate resultsReady for merge. |
Summary
Phase 1 of Python integration: a pure-Python client package (
mlxcel) under a new top-levelpython/directory that drives the existing OpenAI-compatiblemlxcel serveserver. It spawns and supervises a local server process (managed mode) or connects to a running one (connect mode), auto-discovers the served model id, and exposes the rawopenaiclient as an escape hatch for the full API surface. There are zero changes to the Rust inference core; this is Python, a CI workflow, and docs only.What changed
python/src/mlxcel/_server.py:ManagedServerdoes binary discovery (binary=/MLXCEL_BIN/PATH), transport selection (Unix domain socket by default on POSIX with a shortsun_pathunder/tmp, TCP elsewhere or whenhost=/port=is given), subprocess spawn, authoritative/healthreadiness polling with backoff and child-process liveness checks (so a first-run weight download does not fail spuriously and an early exit raisesMlxcelServerErrorwith the captured stderr tail), stderr forwarding to themlxcel.serverlogger on a daemon thread, and graceful SIGTERM-then-SIGKILL shutdown with socket cleanup,atexit, and a__del__finalizer.python/src/mlxcel/_client.py(syncLLM) andpython/src/mlxcel/_async_client.py(AsyncLLM): wrap the OpenAI SDK over a single TCP-or-UDS httpx transport path with explicit timeouts (required when injecting a customhttp_client). Shared mode selection, base-URL normalization, and message-type narrowing live inpython/src/mlxcel/_common.py. Public methods:generate,stream,chat,chat_stream,models,tokenize,detokenize, plusmodelandopenai_clientproperties andclose();tokenize/detokenizecall the native/tokenizeand/detokenizeroutes (no/v1prefix) through the underlying httpx client. The model id is discovered once from/v1/modelsand cached. Mode rules: amodelselects managed mode (withsocket=as the optional bind path),base_url=orsocket=without a model selects connect mode, and amodelplus abase_url/transportconnect target is an error.python/src/mlxcel/_sampling.py: maps Python kwargs to OpenAI request fields and routes server-specific knobs (top_k,min_p,repetition_penalty, DRY) and any unknown keys throughextra_body, withresponse_formatpassthrough; a caller-suppliedextra_body=wins on conflict.python/src/mlxcel/errors.py:MlxcelError,MlxcelServerError(carries the stderr tail),MlxcelTimeoutError. HTTP and API errors propagate as nativeopenaiSDK exceptions rather than being hidden.python/pyproject.toml: hatchling backend, src layout,requires-python >=3.9, depsopenai>=1.40andhttpx>=0.27, adevextra with pytest/ruff/mypy,ruff/mypyconfig, and thee2epytest marker. Shipspy.typed. Distribution and import namemlxcel, version0.1.0.python/tests/:fake_server.pyis a stdlib-only HTTP server (binds UDS or TCP, serves/health503-then-200,/v1/models,/v1/completionsand/v1/chat/completionsincluding SSE streaming variants,/tokenize,/detokenize).test_client_mock.pyuseshttpx.MockTransport(no real server) to assert generate/stream/chat/tokenize behavior, sampling mapping, model auto-discovery, and error mapping.test_lifecycle.pyspawnsfake_server.pyviaManagedServerand exercises discovery, spawn, health-poll, ready, log capture, graceful shutdown, the early-exit failure path, and connect-mode-to-a-running-UDS-server.test_e2e.pyis marked@pytest.mark.e2eand skipped unlessMLXCEL_BINis set.python/examples/:quickstart.py,streaming.py,structured_output.py.python/README.md: install and usage.python/.gitignore: venv, caches, build artifacts..github/workflows/python.yml: runsruff check,ruff format --check,mypy, andpytest(unit + lifecycle; e2e skipped) onubuntu-latest, triggered only onpython/**and the workflow file, independent of the Rust CI.docs/python-client.md(both modes, streaming, chat, structured output, theopenai_clientescape hatch, async usage, troubleshooting incl. the socket-path-length note), linked fromdocs/README.mdand the rootREADME.md, with an English nav entry inmkdocs.yml. The Korean nav and translation are left to the finalizer.Test plan
Verified in a clean venv created outside the repo (
/tmp/mlxcel-venv) withpip install -e python[dev]. The Rust binary was not built and the e2e test was not run (this host is Linux + CUDA; mlxcel targets Apple Silicon), so the real-binary E2E path stays marked@pytest.mark.e2eand skipped.ruff check python-> All checks passedruff format --check python-> 14 files already formattedmypy python/src-> Success: no issues found in 7 source filespytest python/tests -m "not e2e"-> 29 passed, 2 deselectedimport mlxcelworks;mlxcel.__version__ == "0.1.0";LLM/AsyncLLM/error types exported.rsorCargo.*files changedCloses #407