Skip to content

Latest commit

 

History

History
262 lines (207 loc) · 13.6 KB

File metadata and controls

262 lines (207 loc) · 13.6 KB

📄 简体中文 | ✨ New Project: AI-Enhancement-Filter (powered by onnx-tool)


Python 3.6+ PyPI Version License

onnx-tool

A comprehensive toolkit for analyzing, optimizing, and transforming ONNX models with advanced capabilities for LLMs, diffusion models, and computer vision architectures.

  • LLM Optimization: Build and profile large language models with KV cache analysis (example)
  • Graph Transformation:
    • Constant folding (docs)
    • Operator fusion (docs)
  • Advanced Profiling:
    • Rapid shape inference
    • MACs/parameter statistics with sparsity awareness
  • Compute Graph Engine: Runtime shape computation with minimal overhead (details)
  • Memory Compression:
    • Activation memory optimization (up to 95% reduction)
    • Weight quantization (FP16, INT8/INT4 with per-tensor/channel/block schemes)
  • Quantization & Sparsity: Full support for quantized and sparse model analysis

🤖 Supported Model Architectures

Domain Models
NLP BERT, T5, GPT, LLaMa, MPT (TransformerModel)
Diffusion Stable Diffusion (TextEncoder, VAE, UNet)
CV Detic, BEVFormer, SSD300_VGG16, ConvNeXt, Mask R-CNN, Silero VAD
Audio Sovits, LPCNet

⚡ Build & Profile LLMs in Seconds

Profile 10 Hugging Face models in under one second. Export ONNX models with llama.cpp-like simplicity (code).

Model Statistics (1k token input)

model name(1k input) MACs(G) Parameters(G) KV Cache(G)
gpt-j-6b 6277 6.05049 0.234881
yi-1.5-34B 35862 34.3889 0.125829
microsoft/phi-2 2948 2.77944 0.167772
Phi-3-mini-4k 4083 3.82108 0.201327
Phi-3-small-8k-instruct 7912 7.80167 0.0671089
Phi-3-medium-4k-instruct 14665 13.9602 0.104858
Llama3-8B 8029 8.03026 0.0671089
Llama-3.1-70B-Japanese-Instruct-2407 72888 70.5537 0.167772
QWen-7B 7509 7.61562 0.0293601
Qwen2_72B_Instruct 74895 72.7062 0.167772

Latency Estimation (4-bit weights, 16-bit KV cache)

model_type_4bit_kv16bit memory_size(GB) Ultra-155H_TTFT Ultra-155H_TPOT Arc-A770_TTFT Arc-A770_TPOT H100-PCIe_TTFT H100-PCIe_TPOT
gpt-j-6b 3.75678 1.0947 0.041742 0.0916882 0.00670853 0.0164015 0.00187839
yi-1.5-34B 19.3369 5.77095 0.214854 0.45344 0.0345302 0.0747854 0.00966844
microsoft/phi-2 1.82485 0.58361 0.0202761 0.0529628 0.00325866 0.010338 0.000912425
Phi-3-mini-4k 2.49649 0.811173 0.0277388 0.0745356 0.00445802 0.0147274 0.00124825
Phi-3-small-8k-instruct 4.2913 1.38985 0.0476811 0.117512 0.00766303 0.0212535 0.00214565
Phi-3-medium-4k-instruct 7.96977 2.4463 0.088553 0.198249 0.0142317 0.0340576 0.00398489
Llama3-8B 4.35559 1.4354 0.0483954 0.123333 0.00777784 0.0227182 0.00217779
Llama-3.1-70B-Japanese-Instruct-2407 39.4303 11.3541 0.438114 0.868475 0.0704112 0.137901 0.0197151
QWen-7B 4.03576 1.34983 0.0448417 0.11722 0.00720671 0.0218461 0.00201788
Qwen2_72B_Instruct 40.5309 11.6534 0.450343 0.890816 0.0723766 0.14132 0.0202654

💡 Latencies computed from hardware specs – no actual inference required


🔧 Basic Parsing & Editing

Intuitive API for model manipulation:

from onnx_tool import Model

model = Model('model.onnx')          # Load any ONNX file
graph = model.graph                  # Access computation graph
node = graph.nodemap['Conv_0']       # Modify operator attributes
tensor = graph.tensormap['weight']   # Edit tensor data/types
model.save_model('modified.onnx')    # Persist changes

See comprehensive examples in benchmark/examples.py.


📊 Shape Inference & Profiling

All profiling relies on precise shape inference:

Shape inference visualization

Profiling Capabilities

  • Standard profiling: MACs, parameters, memory footprint
  • Sparse-aware profiling: Quantify sparsity impact on compute

MACs profiling table Sparse model profiling

📚 Learn more:


⚙️ Compute Graph & Shape Engine

Transform exported ONNX graphs into efficient Compute Graphs by removing shape-calculation overhead:

Compute graph transformation

  • Compute Graph: Minimal graph containing only compute operations
  • Shape Engine: Runtime shape resolver for dynamic models

Use Cases:

  • Integration with custom inference engines (guide)
  • Shape regression testing (example)

💾 Memory Compression

Activation Memory Compression

Reuses temporary buffers to minimize peak memory usage – critical for LLMs and high-res CV models.

model Native Memory Size(MB) Compressed Memory Size(MB) Compression Ratio(%)
StableDiffusion(VAE_encoder) 14,245 540 3.7
StableDiffusion(VAE_decoder) 25,417 1,140 4.48
StableDiffusion(Text_encoder) 215 5 2.5
StableDiffusion(UNet) 36,135 2,232 6.2
GPT2 40 2 6.9
BERT 2,170 27 1.25

✅ Typical models achieve >90% activation memory reduction
📌 Implementation: benchmark/compression.py

Weight Compression

Essential for deploying large models on memory-constrained devices:

Quantization Scheme Size vs FP32 Example (7B model)
FP32 (baseline) 1.00× 28 GB
FP16 0.50× 14 GB
INT8 (per-channel) 0.25× 7 GB
INT4 (block=32, symmetric) – llama.cpp 0.156× 4.4 GB

Supported schemes:

  • ✅ FP16
  • ✅ INT8: symmetric/asymmetric × per-tensor/channel/block
  • ✅ INT4: symmetric/asymmetric × per-tensor/channel/block

📌 See benchmark/examples.py for implementation examples.


🚀 Installation

# PyPI (recommended)
pip install onnx-tool

# Latest development version
pip install --upgrade git+https://github.com/ThanatosShinji/onnx-tool.git

Requirements: Python ≥ 3.6

⚠️ Troubleshooting: If ONNX installation fails, try:

pip install onnx==1.8.1 && pip install onnx-tool

Known Issues

  • Loop op is not supported
  • Sequence type is not supported

📈 Model Zoo Results

Comprehensive profiling of ONNX Model Zoo and SOTA models. Input shapes defined in data/public/config.py.

📥 Download pre-profiled models (with full tensor shapes):

Model Params(M) MACs(M)
GPT-J 1 layer 464 173,398
MPT 1 layer 261 79,894
text_encoder 123.13 6,782
UNet2DCondition 859.52 888,870
VAE_encoder 34.16 566,371
VAE_decoder 49.49 1,271,959
SqueezeNet 1.0 1.23 351
AlexNet 60.96 665
GoogleNet 6.99 1,606
googlenet_age 5.98 1,605
LResNet100E-IR 65.22 12,102
BERT-Squad 113.61 22,767
BiDAF 18.08 9.87
EfficientNet-Lite4 12.96 1,361
Emotion 12.95 877
Mask R-CNN 46.77 92,077
Model Params(M) MACs(M)
LLaMa 1 layer 618 211,801
BEVFormer Tiny 33.7 210,838
rvm_mobilenetv3 3.73 4,289
yolov4 64.33 3,319
ConvNeXt-L 229.79 34,872
edgenext_small 5.58 1,357
SSD 19.98 216,598
RealESRGAN 16.69 73,551
ShuffleNet 2.29 146
GPT-2 137.02 1,103
T5-encoder 109.62 686
T5-decoder 162.62 1,113
RoBERTa-BASE 124.64 688
Faster R-CNN 44.10 46,018
FCN ResNet-50 35.29 37,056
ResNet50 25 3,868

🤝 Contributing

Contributions are welcome! Please open an issue or PR for:

  • Bug reports
  • Feature requests
  • Documentation improvements
  • New model support