Skip to content

Add feature request template#13

Merged
tscholak merged 1 commit into
mainfrom
tscholak/add-feature-request-template
Oct 22, 2024
Merged

Add feature request template#13
tscholak merged 1 commit into
mainfrom
tscholak/add-feature-request-template

Conversation

@tscholak
Copy link
Copy Markdown
Collaborator

@tscholak tscholak commented Oct 21, 2024

Summary

This PR adds three features to support systematic evaluation of Apriel2 mixer placements:

1. HuggingFace Revision Support

When A/B testing model changes (e.g., comparing old vs new modeling files), we need to specify exact HuggingFace revisions. This was used to compare GSM8K generative results between different commits of apriel2-0.5b-dev.

  • Added revision field to TransformersClientConfig and TransformersBackendConfig
  • Passed to all from_pretrained calls (tokenizer, config, model)
  • Added revision: null to transformers.yaml config

2. Truncation Observability

Previously, truncated completions (hit max_tokens without EOS) were silently skipped, making it impossible to understand truncation rates or debug missing samples.

  • Added truncated: bool field to BaseGeneration
  • Save truncated samples with correct=False instead of skipping them (truncation = failed to produce complete answer)
  • TraceDataset.load_from_directory filters out truncated traces for importance sampling (incomplete reasoning chains are invalid for IS)
  • Updated skip reasons from "all samples truncated" → "no samples returned" (empty records now indicate backend issues, not truncation)

3. MMLU Experiment Configs

Convenience configs for systematic MMLU likelihood evaluations:

  • eval_mmlu_attention.yaml - all-attention baseline
  • eval_mmlu_every{2nd,3rd,4th}_{gdn,kda}.yaml - 50%/33%/25% mixer allocations

Usage: python -m place_layers +experiments=eval_mmlu_every4th_gdn

vLLM vs Transformers Alignment Verification

Comparison of vLLM and Transformers (reference) implementations using test_apriel2.py.

Kernel Configuration

The Transformers implementation can use either upstream FLA/causal-conv1d kernels or vLLM's forked versions. This is controlled by flags in modeling_apriel2.py:

USE_VLLM_CONV = True/False       # vLLM's causal_conv1d vs upstream
USE_VLLM_GDN_OPS = True/False    # vLLM's chunk_gated_delta_rule / fused_recurrent_gated_delta_rule vs upstream FLA
USE_VLLM_GATED_NORM = True/False # vLLM's FusedRMSNormGated vs upstream

Test Methodology

Test Description Parameters
logits Prefill only - single forward pass, compare output logprobs prompt=50 tokens, no generation
compare Prefill + decode - generate tokens and compare at each position prompt=50 tokens, decode=10 tokens, batch=1

Model Configurations Tested

Model Attention Layers Other Mixer Layer Composition
attn-swa 50% full + 50% sliding window (4096) - Alternating full/SWA attention
every2nd-gdn 50% full attention 50% GDN Alternating attn/GDN
every5th-kda 80% full attention 20% KDA 4 attn + 1 KDA pattern

Results: Upstream Kernels (USE_VLLM_*=False)

Prefill (logits test, bfloat16):

Model Match Avg Logprob Diff Max Logprob Diff
attn-swa ✅ YES 0.0001 0.0001
every2nd-gdn ✅ YES 0.087 0.183
every5th-kda ✅ YES 0.015 0.059

Prefill + Decode (compare test, bfloat16):

Model Match Avg Diff Max Diff First Mismatch
attn-swa ❌ NO 6.96 21.0 Position 1
every2nd-gdn ✅ YES 0.106 0.742 -
every5th-kda ✅ YES 0.101 0.865 -

Results: vLLM Kernels (USE_VLLM_*=True)

Prefill (logits test, bfloat16):

Model Match Avg Logprob Diff Max Logprob Diff
attn-swa ✅ YES 0.0001 0.0001
every2nd-gdn ✅ YES 0.0001 0.0001
every5th-kda ✅ YES 0.030 0.037

Prefill + Decode (compare test, bfloat16):

Model Match Avg Diff Max Diff First Mismatch
attn-swa ❌ NO 6.96 21.0 Position 1
every2nd-gdn ❌ NO 3.62 16.6 Position 3
every5th-kda ✅ YES 0.11 0.58 -

Summary Comparison

Model Kernel Config Prefill Diff Decode Match Decode Diff
attn-swa upstream 0.0001 ❌ NO 6.96
attn-swa vLLM 0.0001 ❌ NO 6.96
every2nd-gdn upstream 0.087 ✅ YES 0.106
every2nd-gdn vLLM 0.0001 ❌ NO 3.62
every5th-kda upstream 0.015 ✅ YES 0.101
every5th-kda vLLM 0.030 ✅ YES 0.11

Key Findings

  1. Sliding window attention decode diverges regardless of kernel config - this is due to different FlashAttention implementations (vLLM paged KV cache vs Transformers contiguous cache)

  2. GDN with upstream FLA kernels matches decode (0.106 diff), but GDN with vLLM kernels mismatches (3.62 diff). The vLLM fork of FLA ops behaves differently when used in Transformers' state management context.

  3. GDN prefill is tighter with vLLM kernels (0.0001 vs 0.087) - the chunk_gated_delta_rule implementations differ between upstream FLA and vLLM's fork

  4. KDA matches with both kernel configs - both prefill and decode are aligned

  5. Full attention (non-sliding-window) matches - the 80% full attention in KDA model and 50% in GDN model work correctly

Root Cause Analysis

Sliding Window Attention:

  • Diverges at decode position 1 in bfloat16, regardless of kernel flags
  • Root cause: Different FlashAttention KV cache implementations
    • vLLM: Paged KV cache with block tables
    • Transformers: Contiguous KV cache
  • Matches in float32 (eager attention)

GDN:

  • With upstream FLA: Prefill has higher diff (0.087) but decode matches (0.106)
  • With vLLM kernels: Prefill is tighter (0.0001) but decode diverges (3.62)
  • Root cause: vLLM's fused_recurrent_gated_delta_rule expects vLLM-specific state management that differs from how Transformers manages recurrent state

KDA:

  • Matches with both kernel configs
  • Both chunk and recurrent kernels are aligned between implementations

Implications

  • For likelihood-based evaluation (MMLU): All models are reliable (prefill only)
  • For generative evaluation (GSM8K):
    • KDA models: Reliable across backends with either kernel config
    • GDN models: Use upstream FLA kernels (USE_VLLM_GDN_OPS=False) for backend consistency
    • Sliding-window attention: Will produce different outputs between vLLM and Transformers in bf16

Recommended Configuration

For maximum vLLM/Transformers alignment in generative tasks:

USE_VLLM_CONV = False      # Use upstream causal_conv1d
USE_VLLM_GDN_OPS = False   # Use upstream FLA (critical for GDN decode!)
USE_VLLM_GATED_NORM = False

Test plan

  • Tested revision parameter with different apriel2-0.5b-dev commits (308c3687 vs 98ed07b3)
  • Verified truncated completions are saved with truncated=True and correct=False
  • Ran full GSM8K evaluations comparing old/new model revisions with GDN/KDA at 25% allocation
  • Ran MMLU likelihood evaluations with the new experiment configs
  • Verified vLLM vs Transformers alignment with both upstream and vLLM kernel configurations

🤖 Generated with Claude Code

@tscholak tscholak requested a review from jlamypoirier October 21, 2024 17:54
@tscholak tscholak merged commit 0b52339 into main Oct 22, 2024
@tscholak tscholak deleted the tscholak/add-feature-request-template branch October 22, 2024 13:45
@tscholak tscholak modified the milestone: 0.2.0 Oct 25, 2024
jlamypoirier added a commit that referenced this pull request May 6, 2026
- Drop unused self._preprocessing_config store in Trainer.setup.
- Replace torch.ones + index_add_ with torch.bincount for tok_sum
  in fused_gspo_loss_forward_backward.
- Drop load-bearing-sounding docs_per_step reference from the
  normalize_by_documents field description (no cross-config check
  exists to enforce it).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants