Optimize Docker Build Layers and Add Sudo Privileges for Fast-LLM Container by tscholak · Pull Request #2 · ServiceNow/Fast-LLM

tscholak · 2024-10-16T14:12:10Z

I'd like to refine the Dockerfile slightly to improve build efficiency and add runtime flexibility for the Fast-LLM container. The changes are small but impactful, focusing on two main improvements:

Improved Build Layering for Faster Rebuilds:
- The build process is now split into two distinct stages:
  1. Dependency installation (based on setup.py, setup.cfg, pyproject.toml) is done first.
  2. Fast-LLM code installation is done last, by using the new --exclude= option enabled by Dockerfile syntax version 1.7-labs.
- With this change the dependencies don't need to be reinstalled when the Fast-LLM source code changes. That can reduce rebuild times significantly since code changes land in different Docker image layers than dependencies.
Added Sudo Privileges for Fast-LLM User:
- Introduced password-less sudo privileges to the fast_llm user. This addition allows system adjustments (e.g., modifying system limits or adjusting host settings) directly from within the container.
- I found this very useful in bare Kubernetes environments (like LambdaLabs), where I needed to frequently make changes to system configurations (such as those controllable with ulimit) that do not persist across container restarts.

Here's a breakdown of the build time:

docker build --platform linux/amd64 -t torstenscholak663/fast-llm:latest --build-arg FAST_LLM_USER_ID=1000 .                                   
[+] Building 48.3s (23/23) FINISHED                                                                                                                                                                                                                               docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                                                                                                              0.0s
 => => transferring dockerfile: 1.48kB                                                                                                                                                                                                                                            0.0s
 => resolve image config for docker-image://docker.io/docker/dockerfile:1.7-labs                                                                                                                                                                                                  0.7s
 => [auth] docker/dockerfile:pull token for registry-1.docker.io                                                                                                                                                                                                                  0.0s
 => CACHED docker-image://docker.io/docker/dockerfile:1.7-labs@sha256:b99fecfe00268a8b556fad7d9c37ee25d716ae08a5d7320e6d51c4dd83246894                                                                                                                                            0.0s
 => [internal] load metadata for nvcr.io/nvidia/pytorch:24.07-py3                                                                                                                                                                                                                 1.0s
 => [internal] load .dockerignore                                                                                                                                                                                                                                                 0.0s
 => => transferring context: 163B                                                                                                                                                                                                                                                 0.0s
 => [ 1/15] FROM nvcr.io/nvidia/pytorch:24.07-py3@sha256:f47441c102a810a27758b0b6274d46012ac15fd467119b2e1f0467be82bc8af3                                                                                                                                                         0.0s
 => [internal] load build context                                                                                                                                                                                                                                                 0.0s
 => => transferring context: 12.73kB                                                                                                                                                                                                                                              0.0s
 => CACHED [ 2/15] RUN apt-get update     && apt-get install --no-install-recommends -y git-lfs sudo util-linux     && rm -rf /var/lib/apt/lists/*     && git lfs install                                                                                                         0.0s
 => CACHED [ 3/15] RUN useradd -m -u 1000 -s /bin/bash fast_llm     && echo 'fast_llm ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers                                                                                                                                                    0.0s
 => CACHED [ 4/15] WORKDIR /app                                                                                                                                                                                                                                                   0.0s
 => [ 5/15] COPY --chown=fast_llm ./fast_llm/csrc/ fast_llm/csrc/                                                                                                                                                                                                                 0.1s
 => [ 6/15] RUN make -C ./fast_llm/csrc/                                                                                                                                                                                                                                          4.9s
 => [ 7/15] COPY --chown=fast_llm setup.py setup.cfg ./                                                                                                                                                                                                                           0.0s
 => [ 8/15] RUN PIP_NO_INPUT=1 pip3 install --no-cache-dir ".[CORE,OPTIONAL,DEV]"                                                                                                                                                                                                35.7s
 => [ 9/15] COPY --chown=fast_llm ./Megatron-LM Megatron-LM                                                                                                                                                                                                                       0.0s 
 => [10/15] COPY --chown=fast_llm ./examples examples                                                                                                                                                                                                                             0.0s 
 => [11/15] COPY --chown=fast_llm ./tests tests                                                                                                                                                                                                                                   0.0s 
 => [12/15] COPY --chown=fast_llm ./tools tools                                                                                                                                                                                                                                   0.0s 
 => [13/15] COPY --exclude=./fast_llm/csrc/ --chown=fast_llm ./fast_llm/ fast_llm/                                                                                                                                                                                                0.0s 
 => [14/15] COPY --chown=fast_llm pyproject.toml ./                                                                                                                                                                                                                               0.0s 
 => [15/15] RUN PIP_NO_INPUT=1 pip3 install --no-deps -e .                                                                                                                                                                                                                        4.6s
 => exporting to image                                                                                                                                                                                                                                                            1.0s
 => => exporting layers                                                                                                                                                                                                                                                           1.0s
 => => writing image sha256:f9b20cc3ca3c99ad8d3788cb6eacf5f48d518f48aec3bc3250f8d1d0d7cedeb3                                                                                                                                                                                      0.0s 
 => => naming to docker.io/torstenscholak663/fast-llm:latest                                                                                                                                                                                                                      0.0s

…reserve compiled C++ artifacts, and add sudo for runtime adjustments

jlamypoirier · 2024-10-16T14:31:14Z

+
+# Copy the main source code for Fast-LLM and install in editable mode
+COPY --exclude=./fast_llm/csrc/ --chown=fast_llm ./fast_llm/ fast_llm/
+RUN PIP_NO_INPUT=1 pip3 install --no-deps -e .


Why the need for another install here? What was wrong with the previous version?

This is the most important change: In the previous version, dependencies and code were installed at the same time. Now we first install the dependencies (see above) and only make the editable install at the very end. It's also ensured this way that pip can find and link all code in the fast-llm folder.

Sure, but what's the difference in practice? All setuptools really does is add a symlink to the fast_llm directory so it shouldn't make any real difference

ok, looks like you are right.

jlamypoirier · 2024-10-16T14:31:55Z

-COPY --chown=fast_llm fast_llm/csrc/ ./fast_llm/csrc/
-RUN make -C ./fast_llm/csrc/
+# Copy the dependency files and install dependencies
+COPY --chown=fast_llm setup.py setup.cfg pyproject.toml ./


Why adding the toml? It's not used for the installation.

it appeared to me that the setup.* files and pyproject.toml fulfil similar purposes and could be grouped together. it is not uncommon to specify dependencies in the pyproject.toml file, because this is how setup tools usually works.

We don't do that in fast-llm though, the toml file is just there because black needs it.

ok, I'll split it up then and copy pyproject.toml somewhere else

I don't really think it's worth copying at all... But if we do keep it let's keep it here so we don't add another line

jlamypoirier · 2024-10-16T14:33:58Z


+# Add a user for Fast-LLM with sudo privileges for runtime adjustments
 ARG FAST_LLM_USER_ID=1000
+RUN useradd -m -u $FAST_LLM_USER_ID -s /bin/bash fast_llm \


jlamypoirier · 2024-10-16T14:34:55Z

With this change the dependencies don't need to be reinstalled when the Fast-LLM source code changes. That can reduce rebuild times significantly since code changes land in different Docker image layers than dependencies.

Not sure I'm following here, was it not the case already?

…-LLM into tscholak/tune_dockerfile

tscholak · 2024-10-16T15:02:31Z

Not sure I'm following here, was it not the case already?

People were telling me it was not. I never checked those claims, I just reworked the Dockerfile such that it was clear and sure that we wouldn't always rebuild everything on small code changes. Looks like it wasn't truly necessary. I removed those changes.

jlamypoirier

I double-checked the build times, it was ~100-500 ms before this PR even with code changes. I suspect the complains about rebuilding happened because of the multiple recent changes to dependencies, etc., that do force a re-install.

The rest of this PR looks useful though, I have one minor comment and then we can merge.

jlamypoirier · 2024-10-17T12:55:23Z


+# Copy the dependency files and install dependencies
+COPY --chown=fast_llm setup.py setup.cfg pyproject.toml ./
+RUN PIP_NO_INPUT=1 pip3 install --no-cache-dir -e ".[CORE,OPTIONAL,DEV]"


Let's move this back above the crsc compile, since this installation is much longer.

- Add `_sdp_dim`/`_sdp_active` to `LanguageModelLoss.__init__` so GSPO's SDP branch doesn't AttributeError on the first non-test call. - Replace `document_index.max().item()` (and the SDP MAX all-reduce) with `len(kwargs[BlockKwargs.lengths])`: CPU-side, identical across SDP ranks, removes two GPU→CPU syncs per microbatch. - Decorate `fused_gspo_loss_forward_backward` with `@torch.compile` for parity with GRPO. The `num_segments == 1` test case skips on CPU since torch._inductor's CPU codegen mishandles `index_add_` into a size-1 buffer (atomic_add scatter). - Make `divisor` a required arg on `fused_gspo_loss_forward_backward`: the wrapper always overrides it with the global document count, and the previous local-rank default would silently mis-normalize under SDP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Add `sp_group` arg to fused_gspo_loss_forward_backward and all-reduce the three segment buffers over it when sequence-parallel shards the sequence across the TP group; otherwise per-segment ratios use partial sums and produce silent corruption under SP. Wrapper passes `self._parallel_dim.group` when `_sequence_parallel` is active. - Wire `num_labels_in_seq` through the GSPO test and assert `new_logprobs_fused` against the reference. Required aligning the reference to use scaled logits for new_logprobs (reusing `target_log_probabilities`), matching the kernel's behavior of reporting the loss-path log-probs. - Drop the unreachable `max(num_segments, 1)` guard in the GSPO reference and the matching `divisor=max(num_segments, 1)` at the test call site. SDP all-reduce branch coverage (review item 3) deferred to a follow-up adding a `gspo_loss` flag to `tests/layers/test_lm_head.py` alongside the existing GRPO config, with an SDP distributed variant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tscholak added 3 commits October 10, 2024 14:25

refactor Dockerfile to separate dependency install and source code, p…

f824139

…reserve compiled C++ artifacts, and add sudo for runtime adjustments

refactor Dockerfile to separate dependency install and source code, p…

50de912

…reserve compiled C++ artifacts, and add sudo for runtime adjustments

add util-linux

9ae57e0

tscholak requested a review from jlamypoirier October 16, 2024 14:12

Merge branch 'main' into tscholak/tune_dockerfile

e868a6d

jlamypoirier reviewed Oct 16, 2024

View reviewed changes

tscholak added 3 commits October 16, 2024 10:51

address review comments

ade4b8e

Merge branch 'tscholak/tune_dockerfile' of github.com:ServiceNow/Fast…

f24cef2

…-LLM into tscholak/tune_dockerfile

address review comments

3c42057

tscholak closed this Oct 16, 2024

tscholak reopened this Oct 16, 2024

jlamypoirier reviewed Oct 17, 2024

View reviewed changes

address review comments

8219f7e

jlamypoirier approved these changes Oct 21, 2024

View reviewed changes

jlamypoirier merged commit a21f9b5 into main Oct 21, 2024

jlamypoirier deleted the tscholak/tune_dockerfile branch October 21, 2024 12:31

oleksost mentioned this pull request May 14, 2025

[bug] test_checkpoint test not passing when any lr scale is set to 0 #265

Closed

Conversation

tscholak commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlamypoirier commented Oct 16, 2024

Uh oh!

tscholak commented Oct 16, 2024

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tscholak commented Oct 16, 2024 •

edited

Loading