Add feature request template#13
Merged
Merged
Conversation
jlamypoirier
approved these changes
Oct 22, 2024
jlamypoirier
added a commit
that referenced
this pull request
May 6, 2026
- Drop unused self._preprocessing_config store in Trainer.setup. - Replace torch.ones + index_add_ with torch.bincount for tok_sum in fused_gspo_loss_forward_backward. - Drop load-bearing-sounding docs_per_step reference from the normalize_by_documents field description (no cross-config check exists to enforce it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds three features to support systematic evaluation of Apriel2 mixer placements:
1. HuggingFace Revision Support
When A/B testing model changes (e.g., comparing old vs new modeling files), we need to specify exact HuggingFace revisions. This was used to compare GSM8K generative results between different commits of apriel2-0.5b-dev.
revisionfield toTransformersClientConfigandTransformersBackendConfigfrom_pretrainedcalls (tokenizer, config, model)revision: nulltotransformers.yamlconfig2. Truncation Observability
Previously, truncated completions (hit max_tokens without EOS) were silently skipped, making it impossible to understand truncation rates or debug missing samples.
truncated: boolfield toBaseGenerationcorrect=Falseinstead of skipping them (truncation = failed to produce complete answer)TraceDataset.load_from_directoryfilters out truncated traces for importance sampling (incomplete reasoning chains are invalid for IS)3. MMLU Experiment Configs
Convenience configs for systematic MMLU likelihood evaluations:
eval_mmlu_attention.yaml- all-attention baselineeval_mmlu_every{2nd,3rd,4th}_{gdn,kda}.yaml- 50%/33%/25% mixer allocationsUsage:
python -m place_layers +experiments=eval_mmlu_every4th_gdnvLLM vs Transformers Alignment Verification
Comparison of vLLM and Transformers (reference) implementations using
test_apriel2.py.Kernel Configuration
The Transformers implementation can use either upstream FLA/causal-conv1d kernels or vLLM's forked versions. This is controlled by flags in
modeling_apriel2.py:Test Methodology
Model Configurations Tested
attn-swaevery2nd-gdnevery5th-kdaResults: Upstream Kernels (USE_VLLM_*=False)
Prefill (logits test, bfloat16):
Prefill + Decode (compare test, bfloat16):
Results: vLLM Kernels (USE_VLLM_*=True)
Prefill (logits test, bfloat16):
Prefill + Decode (compare test, bfloat16):
Summary Comparison
Key Findings
Sliding window attention decode diverges regardless of kernel config - this is due to different FlashAttention implementations (vLLM paged KV cache vs Transformers contiguous cache)
GDN with upstream FLA kernels matches decode (0.106 diff), but GDN with vLLM kernels mismatches (3.62 diff). The vLLM fork of FLA ops behaves differently when used in Transformers' state management context.
GDN prefill is tighter with vLLM kernels (0.0001 vs 0.087) - the
chunk_gated_delta_ruleimplementations differ between upstream FLA and vLLM's forkKDA matches with both kernel configs - both prefill and decode are aligned
Full attention (non-sliding-window) matches - the 80% full attention in KDA model and 50% in GDN model work correctly
Root Cause Analysis
Sliding Window Attention:
GDN:
fused_recurrent_gated_delta_ruleexpects vLLM-specific state management that differs from how Transformers manages recurrent stateKDA:
Implications
USE_VLLM_GDN_OPS=False) for backend consistencyRecommended Configuration
For maximum vLLM/Transformers alignment in generative tasks:
Test plan
truncated=Trueandcorrect=False🤖 Generated with Claude Code