📄 简体中文 | ✨ New Project: AI-Enhancement-Filter (powered by onnx-tool)
A comprehensive toolkit for analyzing, optimizing, and transforming ONNX models with advanced capabilities for LLMs, diffusion models, and computer vision architectures.
- LLM Optimization: Build and profile large language models with KV cache analysis (example)
- Graph Transformation:
- Advanced Profiling:
- Rapid shape inference
- MACs/parameter statistics with sparsity awareness
- Compute Graph Engine: Runtime shape computation with minimal overhead (details)
- Memory Compression:
- Activation memory optimization (up to 95% reduction)
- Weight quantization (FP16, INT8/INT4 with per-tensor/channel/block schemes)
- Quantization & Sparsity: Full support for quantized and sparse model analysis
| Domain | Models |
|---|---|
| NLP | BERT, T5, GPT, LLaMa, MPT (TransformerModel) |
| Diffusion | Stable Diffusion (TextEncoder, VAE, UNet) |
| CV | Detic, BEVFormer, SSD300_VGG16, ConvNeXt, Mask R-CNN, Silero VAD |
| Audio | Sovits, LPCNet |
Profile 10 Hugging Face models in under one second. Export ONNX models with llama.cpp-like simplicity (code).
| model name(1k input) | MACs(G) | Parameters(G) | KV Cache(G) |
|---|---|---|---|
| gpt-j-6b | 6277 | 6.05049 | 0.234881 |
| yi-1.5-34B | 35862 | 34.3889 | 0.125829 |
| microsoft/phi-2 | 2948 | 2.77944 | 0.167772 |
| Phi-3-mini-4k | 4083 | 3.82108 | 0.201327 |
| Phi-3-small-8k-instruct | 7912 | 7.80167 | 0.0671089 |
| Phi-3-medium-4k-instruct | 14665 | 13.9602 | 0.104858 |
| Llama3-8B | 8029 | 8.03026 | 0.0671089 |
| Llama-3.1-70B-Japanese-Instruct-2407 | 72888 | 70.5537 | 0.167772 |
| QWen-7B | 7509 | 7.61562 | 0.0293601 |
| Qwen2_72B_Instruct | 74895 | 72.7062 | 0.167772 |
| model_type_4bit_kv16bit | memory_size(GB) | Ultra-155H_TTFT | Ultra-155H_TPOT | Arc-A770_TTFT | Arc-A770_TPOT | H100-PCIe_TTFT | H100-PCIe_TPOT |
|---|---|---|---|---|---|---|---|
| gpt-j-6b | 3.75678 | 1.0947 | 0.041742 | 0.0916882 | 0.00670853 | 0.0164015 | 0.00187839 |
| yi-1.5-34B | 19.3369 | 5.77095 | 0.214854 | 0.45344 | 0.0345302 | 0.0747854 | 0.00966844 |
| microsoft/phi-2 | 1.82485 | 0.58361 | 0.0202761 | 0.0529628 | 0.00325866 | 0.010338 | 0.000912425 |
| Phi-3-mini-4k | 2.49649 | 0.811173 | 0.0277388 | 0.0745356 | 0.00445802 | 0.0147274 | 0.00124825 |
| Phi-3-small-8k-instruct | 4.2913 | 1.38985 | 0.0476811 | 0.117512 | 0.00766303 | 0.0212535 | 0.00214565 |
| Phi-3-medium-4k-instruct | 7.96977 | 2.4463 | 0.088553 | 0.198249 | 0.0142317 | 0.0340576 | 0.00398489 |
| Llama3-8B | 4.35559 | 1.4354 | 0.0483954 | 0.123333 | 0.00777784 | 0.0227182 | 0.00217779 |
| Llama-3.1-70B-Japanese-Instruct-2407 | 39.4303 | 11.3541 | 0.438114 | 0.868475 | 0.0704112 | 0.137901 | 0.0197151 |
| QWen-7B | 4.03576 | 1.34983 | 0.0448417 | 0.11722 | 0.00720671 | 0.0218461 | 0.00201788 |
| Qwen2_72B_Instruct | 40.5309 | 11.6534 | 0.450343 | 0.890816 | 0.0723766 | 0.14132 | 0.0202654 |
💡 Latencies computed from hardware specs – no actual inference required
Intuitive API for model manipulation:
from onnx_tool import Model
model = Model('model.onnx') # Load any ONNX file
graph = model.graph # Access computation graph
node = graph.nodemap['Conv_0'] # Modify operator attributes
tensor = graph.tensormap['weight'] # Edit tensor data/types
model.save_model('modified.onnx') # Persist changesSee comprehensive examples in benchmark/examples.py.
All profiling relies on precise shape inference:
- Standard profiling: MACs, parameters, memory footprint
- Sparse-aware profiling: Quantify sparsity impact on compute
📚 Learn more:
Transform exported ONNX graphs into efficient Compute Graphs by removing shape-calculation overhead:
- Compute Graph: Minimal graph containing only compute operations
- Shape Engine: Runtime shape resolver for dynamic models
Use Cases:
Reuses temporary buffers to minimize peak memory usage – critical for LLMs and high-res CV models.
| model | Native Memory Size(MB) | Compressed Memory Size(MB) | Compression Ratio(%) |
|---|---|---|---|
| StableDiffusion(VAE_encoder) | 14,245 | 540 | 3.7 |
| StableDiffusion(VAE_decoder) | 25,417 | 1,140 | 4.48 |
| StableDiffusion(Text_encoder) | 215 | 5 | 2.5 |
| StableDiffusion(UNet) | 36,135 | 2,232 | 6.2 |
| GPT2 | 40 | 2 | 6.9 |
| BERT | 2,170 | 27 | 1.25 |
✅ Typical models achieve >90% activation memory reduction
📌 Implementation:benchmark/compression.py
Essential for deploying large models on memory-constrained devices:
| Quantization Scheme | Size vs FP32 | Example (7B model) |
|---|---|---|
| FP32 (baseline) | 1.00× | 28 GB |
| FP16 | 0.50× | 14 GB |
| INT8 (per-channel) | 0.25× | 7 GB |
| INT4 (block=32, symmetric) – llama.cpp | 0.156× | 4.4 GB |
Supported schemes:
- ✅ FP16
- ✅ INT8: symmetric/asymmetric × per-tensor/channel/block
- ✅ INT4: symmetric/asymmetric × per-tensor/channel/block
📌 See benchmark/examples.py for implementation examples.
# PyPI (recommended)
pip install onnx-tool
# Latest development version
pip install --upgrade git+https://github.com/ThanatosShinji/onnx-tool.gitRequirements: Python ≥ 3.6
⚠️ Troubleshooting: If ONNX installation fails, try:pip install onnx==1.8.1 && pip install onnx-tool
- Loop op is not supported
- Sequence type is not supported
Comprehensive profiling of ONNX Model Zoo and SOTA models. Input shapes defined in data/public/config.py.
📥 Download pre-profiled models (with full tensor shapes):
- Baidu Drive (code:
p91k) - Google Drive
|
|
Contributions are welcome! Please open an issue or PR for:
- Bug reports
- Feature requests
- Documentation improvements
- New model support



