Optimize Docker Build Layers and Add Sudo Privileges for Fast-LLM Container#2
Conversation
…reserve compiled C++ artifacts, and add sudo for runtime adjustments
…reserve compiled C++ artifacts, and add sudo for runtime adjustments
|
|
||
| # Copy the main source code for Fast-LLM and install in editable mode | ||
| COPY --exclude=./fast_llm/csrc/ --chown=fast_llm ./fast_llm/ fast_llm/ | ||
| RUN PIP_NO_INPUT=1 pip3 install --no-deps -e . |
There was a problem hiding this comment.
Why the need for another install here? What was wrong with the previous version?
There was a problem hiding this comment.
This is the most important change: In the previous version, dependencies and code were installed at the same time. Now we first install the dependencies (see above) and only make the editable install at the very end. It's also ensured this way that pip can find and link all code in the fast-llm folder.
There was a problem hiding this comment.
Sure, but what's the difference in practice? All setuptools really does is add a symlink to the fast_llm directory so it shouldn't make any real difference
There was a problem hiding this comment.
ok, looks like you are right.
| COPY --chown=fast_llm fast_llm/csrc/ ./fast_llm/csrc/ | ||
| RUN make -C ./fast_llm/csrc/ | ||
| # Copy the dependency files and install dependencies | ||
| COPY --chown=fast_llm setup.py setup.cfg pyproject.toml ./ |
There was a problem hiding this comment.
Why adding the toml? It's not used for the installation.
There was a problem hiding this comment.
it appeared to me that the setup.* files and pyproject.toml fulfil similar purposes and could be grouped together. it is not uncommon to specify dependencies in the pyproject.toml file, because this is how setup tools usually works.
There was a problem hiding this comment.
We don't do that in fast-llm though, the toml file is just there because black needs it.
There was a problem hiding this comment.
ok, I'll split it up then and copy pyproject.toml somewhere else
There was a problem hiding this comment.
I don't really think it's worth copying at all... But if we do keep it let's keep it here so we don't add another line
|
|
||
| # Add a user for Fast-LLM with sudo privileges for runtime adjustments | ||
| ARG FAST_LLM_USER_ID=1000 | ||
| RUN useradd -m -u $FAST_LLM_USER_ID -s /bin/bash fast_llm \ |
Not sure I'm following here, was it not the case already? |
…-LLM into tscholak/tune_dockerfile
People were telling me it was not. I never checked those claims, I just reworked the Dockerfile such that it was clear and sure that we wouldn't always rebuild everything on small code changes. Looks like it wasn't truly necessary. I removed those changes. |
jlamypoirier
left a comment
There was a problem hiding this comment.
I double-checked the build times, it was ~100-500 ms before this PR even with code changes. I suspect the complains about rebuilding happened because of the multiple recent changes to dependencies, etc., that do force a re-install.
The rest of this PR looks useful though, I have one minor comment and then we can merge.
|
|
||
| # Copy the dependency files and install dependencies | ||
| COPY --chown=fast_llm setup.py setup.cfg pyproject.toml ./ | ||
| RUN PIP_NO_INPUT=1 pip3 install --no-cache-dir -e ".[CORE,OPTIONAL,DEV]" |
There was a problem hiding this comment.
Let's move this back above the crsc compile, since this installation is much longer.
- Add `_sdp_dim`/`_sdp_active` to `LanguageModelLoss.__init__` so GSPO's SDP branch doesn't AttributeError on the first non-test call. - Replace `document_index.max().item()` (and the SDP MAX all-reduce) with `len(kwargs[BlockKwargs.lengths])`: CPU-side, identical across SDP ranks, removes two GPU→CPU syncs per microbatch. - Decorate `fused_gspo_loss_forward_backward` with `@torch.compile` for parity with GRPO. The `num_segments == 1` test case skips on CPU since torch._inductor's CPU codegen mishandles `index_add_` into a size-1 buffer (atomic_add scatter). - Make `divisor` a required arg on `fused_gspo_loss_forward_backward`: the wrapper always overrides it with the global document count, and the previous local-rank default would silently mis-normalize under SDP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add `sp_group` arg to fused_gspo_loss_forward_backward and all-reduce the three segment buffers over it when sequence-parallel shards the sequence across the TP group; otherwise per-segment ratios use partial sums and produce silent corruption under SP. Wrapper passes `self._parallel_dim.group` when `_sequence_parallel` is active. - Wire `num_labels_in_seq` through the GSPO test and assert `new_logprobs_fused` against the reference. Required aligning the reference to use scaled logits for new_logprobs (reusing `target_log_probabilities`), matching the kernel's behavior of reporting the loss-path log-probs. - Drop the unreachable `max(num_segments, 1)` guard in the GSPO reference and the matching `divisor=max(num_segments, 1)` at the test call site. SDP all-reduce branch coverage (review item 3) deferred to a follow-up adding a `gspo_loss` flag to `tests/layers/test_lm_head.py` alongside the existing GRPO config, with an SDP distributed variant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
I'd like to refine the
Dockerfileslightly to improve build efficiency and add runtime flexibility for the Fast-LLM container. The changes are small but impactful, focusing on two main improvements:setup.py,setup.cfg,pyproject.toml) is done first.--exclude=option enabled by Dockerfile syntax version1.7-labs.fast_llmuser. This addition allows system adjustments (e.g., modifying system limits or adjusting host settings) directly from within the container.ulimit) that do not persist across container restarts.Here's a breakdown of the build time: