Add feature request template by tscholak · Pull Request #13 · ServiceNow/Fast-LLM

tscholak · 2024-10-21T17:54:37Z

Summary

This PR adds three features to support systematic evaluation of Apriel2 mixer placements:

1. HuggingFace Revision Support

When A/B testing model changes (e.g., comparing old vs new modeling files), we need to specify exact HuggingFace revisions. This was used to compare GSM8K generative results between different commits of apriel2-0.5b-dev.

Added revision field to TransformersClientConfig and TransformersBackendConfig
Passed to all from_pretrained calls (tokenizer, config, model)
Added revision: null to transformers.yaml config

2. Truncation Observability

Previously, truncated completions (hit max_tokens without EOS) were silently skipped, making it impossible to understand truncation rates or debug missing samples.

Added truncated: bool field to BaseGeneration
Save truncated samples with correct=False instead of skipping them (truncation = failed to produce complete answer)
TraceDataset.load_from_directory filters out truncated traces for importance sampling (incomplete reasoning chains are invalid for IS)
Updated skip reasons from "all samples truncated" → "no samples returned" (empty records now indicate backend issues, not truncation)

3. MMLU Experiment Configs

Convenience configs for systematic MMLU likelihood evaluations:

eval_mmlu_attention.yaml - all-attention baseline
eval_mmlu_every{2nd,3rd,4th}_{gdn,kda}.yaml - 50%/33%/25% mixer allocations

Usage: python -m place_layers +experiments=eval_mmlu_every4th_gdn

vLLM vs Transformers Alignment Verification

Comparison of vLLM and Transformers (reference) implementations using test_apriel2.py.

Kernel Configuration

The Transformers implementation can use either upstream FLA/causal-conv1d kernels or vLLM's forked versions. This is controlled by flags in modeling_apriel2.py:

USE_VLLM_CONV = True/False       # vLLM's causal_conv1d vs upstream
USE_VLLM_GDN_OPS = True/False    # vLLM's chunk_gated_delta_rule / fused_recurrent_gated_delta_rule vs upstream FLA
USE_VLLM_GATED_NORM = True/False # vLLM's FusedRMSNormGated vs upstream

Test Methodology

Test	Description	Parameters
logits	Prefill only - single forward pass, compare output logprobs	prompt=50 tokens, no generation
compare	Prefill + decode - generate tokens and compare at each position	prompt=50 tokens, decode=10 tokens, batch=1

Model Configurations Tested

Model	Attention Layers	Other Mixer	Layer Composition
`attn-swa`	50% full + 50% sliding window (4096)	-	Alternating full/SWA attention
`every2nd-gdn`	50% full attention	50% GDN	Alternating attn/GDN
`every5th-kda`	80% full attention	20% KDA	4 attn + 1 KDA pattern

Results: Upstream Kernels (USE_VLLM_*=False)

Prefill (logits test, bfloat16):

Model	Match	Avg Logprob Diff	Max Logprob Diff
attn-swa	✅ YES	0.0001	0.0001
every2nd-gdn	✅ YES	0.087	0.183
every5th-kda	✅ YES	0.015	0.059

Prefill + Decode (compare test, bfloat16):

Model	Match	Avg Diff	Max Diff	First Mismatch
attn-swa	❌ NO	6.96	21.0	Position 1
every2nd-gdn	✅ YES	0.106	0.742	-
every5th-kda	✅ YES	0.101	0.865	-

Results: vLLM Kernels (USE_VLLM_*=True)

Prefill (logits test, bfloat16):

Model	Match	Avg Logprob Diff	Max Logprob Diff
attn-swa	✅ YES	0.0001	0.0001
every2nd-gdn	✅ YES	0.0001	0.0001
every5th-kda	✅ YES	0.030	0.037

Prefill + Decode (compare test, bfloat16):

Model	Match	Avg Diff	Max Diff	First Mismatch
attn-swa	❌ NO	6.96	21.0	Position 1
every2nd-gdn	❌ NO	3.62	16.6	Position 3
every5th-kda	✅ YES	0.11	0.58	-

Summary Comparison

Model	Kernel Config	Prefill Diff	Decode Match	Decode Diff
attn-swa	upstream	0.0001	❌ NO	6.96
attn-swa	vLLM	0.0001	❌ NO	6.96
every2nd-gdn	upstream	0.087	✅ YES	0.106
every2nd-gdn	vLLM	0.0001	❌ NO	3.62
every5th-kda	upstream	0.015	✅ YES	0.101
every5th-kda	vLLM	0.030	✅ YES	0.11

Key Findings

Sliding window attention decode diverges regardless of kernel config - this is due to different FlashAttention implementations (vLLM paged KV cache vs Transformers contiguous cache)
GDN with upstream FLA kernels matches decode (0.106 diff), but GDN with vLLM kernels mismatches (3.62 diff). The vLLM fork of FLA ops behaves differently when used in Transformers' state management context.
GDN prefill is tighter with vLLM kernels (0.0001 vs 0.087) - the chunk_gated_delta_rule implementations differ between upstream FLA and vLLM's fork
KDA matches with both kernel configs - both prefill and decode are aligned
Full attention (non-sliding-window) matches - the 80% full attention in KDA model and 50% in GDN model work correctly

Root Cause Analysis

Sliding Window Attention:

Diverges at decode position 1 in bfloat16, regardless of kernel flags
Root cause: Different FlashAttention KV cache implementations
- vLLM: Paged KV cache with block tables
- Transformers: Contiguous KV cache
Matches in float32 (eager attention)

GDN:

With upstream FLA: Prefill has higher diff (0.087) but decode matches (0.106)
With vLLM kernels: Prefill is tighter (0.0001) but decode diverges (3.62)
Root cause: vLLM's fused_recurrent_gated_delta_rule expects vLLM-specific state management that differs from how Transformers manages recurrent state

KDA:

Matches with both kernel configs
Both chunk and recurrent kernels are aligned between implementations

Implications

For likelihood-based evaluation (MMLU): All models are reliable (prefill only)
For generative evaluation (GSM8K):
- KDA models: Reliable across backends with either kernel config
- GDN models: Use upstream FLA kernels (USE_VLLM_GDN_OPS=False) for backend consistency
- Sliding-window attention: Will produce different outputs between vLLM and Transformers in bf16

Recommended Configuration

For maximum vLLM/Transformers alignment in generative tasks:

USE_VLLM_CONV = False      # Use upstream causal_conv1d
USE_VLLM_GDN_OPS = False   # Use upstream FLA (critical for GDN decode!)
USE_VLLM_GATED_NORM = False

Test plan

Tested revision parameter with different apriel2-0.5b-dev commits (308c3687 vs 98ed07b3)
Verified truncated completions are saved with truncated=True and correct=False
Ran full GSM8K evaluations comparing old/new model revisions with GDN/KDA at 25% allocation
Ran MMLU likelihood evaluations with the new experiment configs
Verified vLLM vs Transformers alignment with both upstream and vLLM kernel configurations

🤖 Generated with Claude Code

- Drop unused self._preprocessing_config store in Trainer.setup. - Replace torch.ones + index_add_ with torch.bincount for tok_sum in fused_gspo_loss_forward_backward. - Drop load-bearing-sounding docs_per_step reference from the normalize_by_documents field description (no cross-config check exists to enforce it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add feature request template

9c46932

tscholak requested a review from jlamypoirier October 21, 2024 17:54

jlamypoirier approved these changes Oct 22, 2024

View reviewed changes

tscholak merged commit 0b52339 into main Oct 22, 2024

tscholak deleted the tscholak/add-feature-request-template branch October 22, 2024 13:45

tscholak modified the milestone: 0.2.0 Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add feature request template#13

Add feature request template#13
tscholak merged 1 commit into
mainfrom
tscholak/add-feature-request-template

tscholak commented Oct 21, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tscholak commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. HuggingFace Revision Support

2. Truncation Observability

3. MMLU Experiment Configs

vLLM vs Transformers Alignment Verification

Kernel Configuration

Test Methodology

Model Configurations Tested

Results: Upstream Kernels (USE_VLLM_*=False)

Results: vLLM Kernels (USE_VLLM_*=True)

Summary Comparison

Key Findings

Root Cause Analysis

Implications

Recommended Configuration

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tscholak commented Oct 21, 2024 •

edited

Loading