StarDoc model training by akshaykalkunte · Pull Request #5 · ServiceNow/Fast-LLM

akshaykalkunte · 2024-10-17T01:31:46Z

WIP StarDoc model integration into FastLLM

tscholak · 2024-11-11T15:01:11Z

Hi @jlamypoirier! @akshaykalkunte and I talked and we want to push this PR over the finish line. There's a lot going on here, and we should review the approach top down to decide how this needs to be refactored to go into main. At the top of my head are the following separate concerns:

Model architecture: Are VLMs GPTs from the point of view of Fast-LLM? I think they aren't because too much is different. We should add a new model architecture (e.g. "vlm") to Fast-LLM.
Data preprocessing: Related to Add prepare command #38, we should factor out data preprocessing and introduce an offline preprocessing step, fast-llm prepare_data vlm --config stardoc.yaml, that makes VLMMemmapDatasets and stores them on disk.
Vision encoder implementation: Right now it's a monolithic wrapper layer that uses a HF auto model. We should discuss if and when we reimplement this in Fast-LLM. This can be a separate effort and (as a side effect) result in yet another model class, vision_encoder, that we can also train from scratch if we wanted to.
Cross-attention instead of adapter layer: StarDoc is moving towards a special form of cross-attention between the vision encoder and the LM decoder. This likely has implications for parallelization.
Llama 3 support: StarDoc will use pre-trained Llama 3.2 (text-only?) models, we need to be able to load them. See also [feat] Llama 3.x rope scaling support #39.
YAML configs: This PR currently doesn't support Fast-LLM's new YAML-based configs.

I think we can divide and conquer here.

tscholak · 2025-02-07T14:11:45Z

4. Cross-attention instead of adapter layer: StarDoc is moving towards a special form of cross-attention between the vision encoder and the LM decoder. This likely has implications for parallelization.

As @akshaykalkunte pointed out recently, AlignVLM will be the best path forward for this first implementation. I read the paper and I don't see any obstacles. The method is refreshingly simple.

tscholak · 2025-05-09T15:02:22Z

I think it's time to close this one since we have #227

Splits the policy-gradient loss config and class hierarchy: - LanguageModelPolicyGradientLossConfig (abstract base): shared fields (epsilon_low/high, metrics, normalize_by_documents, temperature). - LanguageModelGRPOLossConfig: registers `type: grpo` (keeps GRPO-only use_triton). - LanguageModelGSPOLossConfig: registers `type: gspo`. - LanguageModelPolicyGradientLoss (abstract base): shared __init__/_forward_backward/_register_extra_metrics/get_loss_definitions/ get_preprocessing_config plumbing; abstract `_call_kernel`. - LanguageModelGRPOLoss / LanguageModelGSPOLoss: each implements `_call_kernel` against its kernel; GSPO overrides `get_preprocessing_config` to add `return_document_index`. Drops the stringly-typed `policy_loss: str` switch and the in-method if/else dispatch, addressing review items #1 and #5 plus Note 2. YAML migration: `type: grpo` + `policy_loss: gspo` → `type: gspo`. No checked-in YAML configs use the old form. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Add `_sdp_dim`/`_sdp_active` to `LanguageModelLoss.__init__` so GSPO's SDP branch doesn't AttributeError on the first non-test call. - Replace `document_index.max().item()` (and the SDP MAX all-reduce) with `len(kwargs[BlockKwargs.lengths])`: CPU-side, identical across SDP ranks, removes two GPU→CPU syncs per microbatch. - Decorate `fused_gspo_loss_forward_backward` with `@torch.compile` for parity with GRPO. The `num_segments == 1` test case skips on CPU since torch._inductor's CPU codegen mishandles `index_add_` into a size-1 buffer (atomic_add scatter). - Make `divisor` a required arg on `fused_gspo_loss_forward_backward`: the wrapper always overrides it with the global document count, and the previous local-rank default would silently mis-normalize under SDP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

akshaykalkunte and others added 5 commits October 9, 2024 23:22

stardoc_init

f9dc5d6

Code cleanup

162438f

Merge remote-tracking branch 'origin/main' into akshay/stardoc

2960957

cleanup 2

b2d0c6e

Merge branch 'main' into akshay/stardoc

14f82ed

jlamypoirier mentioned this pull request Oct 25, 2024

[feat] Integrate dataset re-weighting and preprocessing into Fast-LLM for streamlined data loading #25

Closed

tscholak mentioned this pull request Nov 11, 2024

Add prepare command #38

Merged

24 tasks

akshaykalkunte added 3 commits November 14, 2024 00:24

Re-factor to new changes in public repo

00be01f

Merge remote-tracking branch 'origin' into akshay/stardoc

18fded2

Merge remote-tracking branch 'origin/main' into akshay/stardoc

e271ac6

jlamypoirier mentioned this pull request Dec 17, 2024

Fix llama conversion, improve parameter conversion #94

Merged

8 tasks

tscholak closed this May 9, 2025

jlamypoirier deleted the akshay/stardoc branch September 19, 2025 01:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StarDoc model training#5

StarDoc model training#5
akshaykalkunte wants to merge 8 commits into
mainfrom
akshay/stardoc

akshaykalkunte commented Oct 17, 2024

Uh oh!

tscholak commented Nov 11, 2024 •

edited

Loading

Uh oh!

tscholak commented Feb 7, 2025

Uh oh!

tscholak commented May 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

akshaykalkunte commented Oct 17, 2024

Uh oh!

tscholak commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tscholak commented Feb 7, 2025

Uh oh!

tscholak commented May 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tscholak commented Nov 11, 2024 •

edited

Loading