| name | reduce-model |
|---|---|
| description | Create a reduced (fewer-layer) version of a large HuggingFace model for fast testing and debugging. Use this skill whenever the user wants to reduce a model's layer count, create a small test model, download only part of a model's weights, debug quantization on a smaller model, speed up the testing-verification loop for large LLMs, or create a lightweight version of any HF model for rapid iteration. Trigger on phrases like "reduce model", "small test model", "fewer layers", "download partial model", "debug with smaller model", "shrink model for testing", even if the user doesn't use the exact term "reduce". |
Create a reduced (fewer-layer) version of any HuggingFace model for fast end-to-end testing. The reduced model produces meaningless text but proves the full pipeline (download → load → generate) works with no runtime errors.
This is invaluable when working with large LLMs (hundreds of GB) where the testing-verification loop is painfully slow. Instead of downloading and loading the full model, you work with a 4-layer version that's a fraction of the size.
- User wants to test quantization workflows without waiting for the full model
- User needs a quick sanity check that a model loads and generates correctly
- User wants to reduce download size for development/debugging
- User mentions models like DeepSeek-R1, Qwen3, Llama, Mistral and needs a smaller version
The process has 6 phases. Each phase builds on the previous one — don't skip ahead.
Before writing any code, you need to understand the model and the environment.
From the user, confirm:
- The HuggingFace model ID (e.g.,
Qwen/Qwen3-30B-A3B) - How many layers to keep (default: 4, which is enough for a smoke test)
- Where to save the reduced model (ask about available disk — models can be several GB even reduced)
- Whether to test on CPU or GPU (CPU is simpler and usually sufficient)
From HuggingFace, fetch:
- The model card — to understand the architecture, any special loading requirements (
trust_remote_code), and the minimumtransformersversion config.json— to findnum_hidden_layers,model_type, and architecture-specific fields. Seereferences/model-patterns.mdfor common patterns across model familiesmodel.safetensors.index.json— to map which layers live in which shard files. This is how you figure out what to download
From the local environment, check:
- Disk space at the target directory (
df -h) - Whether
hfCLI is available (which hf) - Whether a Python venv exists or needs to be created
- GPU availability if the user wants GPU testing (
nvidia-smi)
The reduced model needs a working Python environment with the right packages. Use uv if available (faster), otherwise python -m venv.
uv venv .venv
source .venv/bin/activate
uv pip install "transformers>=<version-from-model-card>" torch safetensors accelerateThe accelerate package is easy to forget but required whenever you use device_map in from_pretrained. Install it upfront to avoid a confusing ValueError later.
The minimum transformers version matters — newer model architectures (like qwen3_moe) need recent transformers or you'll get a KeyError on the model type. Check the model card.
Download just the config, tokenizer, and index files — not the multi-GB weight files yet. This is the two-phase approach: get the metadata first, patch it, then download only the shards you actually need.
hf download <MODEL_ID> \
--local-dir <OUTPUT_DIR> \
--include "*.json" --include "*.txt" --include "*.model" --include "*.tiktoken"For models that use trust_remote_code=True (like DeepSeek-R1), also download *.py files since they contain custom model code:
hf download <MODEL_ID> \
--local-dir <OUTPUT_DIR> \
--include "*.json" --include "*.txt" --include "*.model" --include "*.tiktoken" --include "*.py"Important gotcha: Each glob pattern needs its own --include flag. Writing --include "*.json" "*.txt" silently downloads the wrong files. Always use --include "*.json" --include "*.txt" --include ....
This typically downloads ~10-20MB: config.json, generation_config.json, model.safetensors.index.json, tokenizer files, and vocabulary files.
Write a Python script (patch_model.py) that modifies two files in-place:
config.json:
- Set
num_hidden_layersto the target layer count (e.g., 48 → 4). - If
max_window_layersis present, cap its value to be no more than the newnum_hidden_layers. - Leave everything else unchanged — MoE config, attention heads, vocabulary size, etc. are all per-layer properties that don't depend on the total layer count.
model.safetensors.index.json:
- Parse the
weight_mapdictionary - Keep entries matching:
model.embed_tokens.*,model.layers.{0..N-1}.*,model.norm.*,lm_head.* - Remove all entries for layers ≥ N
- Leave
metadata.total_sizeas-is (it's informational only, transformers ignores it) - Don't rename shard files — transformers resolves files by the
weight_map, not by filename numbering
The script should print the list of unique .safetensors filenames still referenced to stdout. The shell script captures this to know what to download next.
Example output for Qwen3-30B-A3B with 4 layers:
model-00001-of-00016.safetensors # layers 0-1 + embeddings
model-00002-of-00016.safetensors # layers 2-3
model-00016-of-00016.safetensors # lm_head + model.norm
Using the shard list from Phase 4, download each file individually:
for shard in $SHARDS; do
hf download <MODEL_ID> --local-dir <OUTPUT_DIR> --include "$shard"
doneThis is where the big savings happen. For Qwen3-30B-A3B, this downloads ~8.7GB instead of ~60GB. The ratio depends on the model — models with more experts per layer have larger per-shard files.
Write a Python script (test_generation.py) that:
- Loads the tokenizer from the reduced model directory
- Loads the model with
torch_dtype="auto",device_map="cpu"(or"auto"for GPU), andtrust_remote_code=Trueif the model card says so - Prepares a simple prompt like
"The capital of France is"usingapply_chat_template - Generates a small number of tokens (20 is enough)
- Prints the output text and a
SUCCESSmarker
The output text will be meaningless from a 4-layer model — that's expected. The point is proving there are no runtime errors: no missing weights, no shape mismatches, no import failures.
Model-specific considerations for the smoke test:
- Some models support
enable_thinking(like Qwen3) — set it toFalsefor simpler output - Some models need
trust_remote_code=True— always check the model card - "UNEXPECTED key" warnings during loading are normal and harmless — they come from shard files that contain weights for layers beyond your reduced set
Tie it all together with a shell script (reduce.sh) that runs phases 3-6 in sequence:
#!/bin/bash
set -euo pipefail
MODEL_ID="<model-id>"
OUTPUT_DIR="<output-dir>"
NUM_LAYERS=4
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
# Activate venv
source "$SCRIPT_DIR/.venv/bin/activate"
# Phase 3: Download metadata
hf download "$MODEL_ID" --local-dir "$OUTPUT_DIR" \
--include "*.json" --include "*.txt" --include "*.model" --include "*.tiktoken"
# Phase 4: Patch
SHARDS=$(python "$SCRIPT_DIR/patch_model.py" --model-dir "$OUTPUT_DIR" --num-layers "$NUM_LAYERS")
# Phase 5: Download shards
INCLUDE_ARGS=()
for shard in $SHARDS; do
INCLUDE_ARGS+=(--include "$shard")
done
hf download "$MODEL_ID" --local-dir "$OUTPUT_DIR" "${INCLUDE_ARGS[@]}"
# Phase 6: Smoke test
python "$SCRIPT_DIR/test_generation.py" --model-dir "$OUTPUT_DIR"After the workflow completes, verify:
- Only the expected shard files are in the output directory (typically 2-4 files)
-
config.jsonshows the reducednum_hidden_layersvalue -
model.safetensors.index.jsonhas no entries for layers ≥ N - The smoke test prints generated text and
SUCCESS - UNEXPECTED key warnings (if any) are for layers beyond the reduced set — these are harmless
| Problem | Cause | Fix |
|---|---|---|
hf download only fetches 1 file |
Glob patterns space-separated after single --include |
Use separate --include flags for each pattern |
ValueError: device_map requires accelerate |
accelerate package missing |
pip install accelerate |
KeyError: '<model_type>' |
transformers version too old for this model |
Upgrade to the version specified on the model card |
ImportError: cannot import name 'is_flash_attn_...' |
Model's custom code (e.g., DeepSeek-R1) incompatible with transformers 5.x | Use transformers 4.x (pip install "transformers>=4.46,<5.0"). Also download *.py files in Phase 3 for models with trust_remote_code=True |
| Shape mismatch errors during loading | Kept wrong layers or missed embedding/norm layers | Check weight_map filtering — must keep embed_tokens, norm, lm_head |
| Model generates nothing / hangs | Too few layers for the generation config | Set max_new_tokens=20 explicitly; don't rely on model defaults |
references/model-patterns.md— Architecture patterns for common model families (dense vs MoE, layer naming conventions, which config fields to check). Read this when working with an unfamiliar model architecture.