Skip to content

Activation/feature-level distillation #385

@RaymondLi0

Description

@RaymondLi0

🎯 Goal (What & Why)

Add activation-level distillation, usually leading to better student performance.

🚀 Execution Plan

Step 1: What is the smallest working version?

  • Distill based on all mixer-layer outputs
  • Support the case where the student has the same number of layers as the teacher.
  • Use MSE loss.
  • Add a single coefficient that balances feature-level distillation vs logit-level.

As a first version:
$$L = L_{\text{logit}} + \lambda L_{\text{activation}}$$
with:
$$\mathcal{L}_{\text{activation}} = \frac{1}{N} \sum_{i=1}^{N}||T_i(\mathbf{x}) - S_i(\mathbf{x})||_2$$
$T_i(x)$ and $S_i(x)$ denoting the outputs of the i-th layer's mixer of the teacher and student models.

Details:

  • teacher stores intermediate activations in kwargs
  • student uses kwargs to compute activation-distillation losses, which it stores in losses

Step 2: What additional optimizations are possible (but optional)?

  • Should support TP with sequence-parallelism (this is actually not optional, but can be done in a second step)
  • Can configure which layers' outputs are used for distillation. For example, we could distill only based on mixer-layer outputs, or also based on MLP outputs, etc. Pass a {student -> teacher} mapping of layer-names to use for distillation
  • Configurable loss: MSE, cosine, others?

📌 Acceptance Criteria (Must-Haves for Completion)

  • The feature must be functional and tested.
  • The implementation must be documented in practical terms.
  • The PR must include a performance/impact summary.
  • No refactors unless directly necessary for feature completion.

🛠️ Project Management

  • Assign the project to the Fast-LLM project.
  • Set the Estimate field (in days) in the GitHub project.
  • Use the Size field to categorize the PR size (Small/Medium/Large).
  • Assign an owner when opening the issue.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions