📄 简体中文 | ✨ New Project: AI-Enhancement-Filter (powered by onnx-tool)

onnx-tool

A comprehensive toolkit for analyzing, optimizing, and transforming ONNX models with advanced capabilities for LLMs, diffusion models, and computer vision architectures.

LLM Optimization: Build and profile large language models with KV cache analysis (example)
Graph Transformation:
- Constant folding (docs)
- Operator fusion (docs)
Advanced Profiling:
- Rapid shape inference
- MACs/parameter statistics with sparsity awareness
Compute Graph Engine: Runtime shape computation with minimal overhead (details)
Memory Compression:
- Activation memory optimization (up to 95% reduction)
- Weight quantization (FP16, INT8/INT4 with per-tensor/channel/block schemes)
Quantization & Sparsity: Full support for quantized and sparse model analysis

🤖 Supported Model Architectures

Domain	Models
NLP	BERT, T5, GPT, LLaMa, MPT (TransformerModel)
Diffusion	Stable Diffusion (TextEncoder, VAE, UNet)
CV	Detic, BEVFormer, SSD300_VGG16, ConvNeXt, Mask R-CNN, Silero VAD
Audio	Sovits, LPCNet

⚡ Build & Profile LLMs in Seconds

Profile 10 Hugging Face models in under one second. Export ONNX models with llama.cpp-like simplicity (code).

Model Statistics (1k token input)

model name(1k input)	MACs(G)	Parameters(G)	KV Cache(G)
gpt-j-6b	6277	6.05049	0.234881
yi-1.5-34B	35862	34.3889	0.125829
microsoft/phi-2	2948	2.77944	0.167772
Phi-3-mini-4k	4083	3.82108	0.201327
Phi-3-small-8k-instruct	7912	7.80167	0.0671089
Phi-3-medium-4k-instruct	14665	13.9602	0.104858
Llama3-8B	8029	8.03026	0.0671089
Llama-3.1-70B-Japanese-Instruct-2407	72888	70.5537	0.167772
QWen-7B	7509	7.61562	0.0293601
Qwen2_72B_Instruct	74895	72.7062	0.167772

Latency Estimation (4-bit weights, 16-bit KV cache)

model_type_4bit_kv16bit	memory_size(GB)	Ultra-155H_TTFT	Ultra-155H_TPOT	Arc-A770_TTFT	Arc-A770_TPOT	H100-PCIe_TTFT	H100-PCIe_TPOT
gpt-j-6b	3.75678	1.0947	0.041742	0.0916882	0.00670853	0.0164015	0.00187839
yi-1.5-34B	19.3369	5.77095	0.214854	0.45344	0.0345302	0.0747854	0.00966844
microsoft/phi-2	1.82485	0.58361	0.0202761	0.0529628	0.00325866	0.010338	0.000912425
Phi-3-mini-4k	2.49649	0.811173	0.0277388	0.0745356	0.00445802	0.0147274	0.00124825
Phi-3-small-8k-instruct	4.2913	1.38985	0.0476811	0.117512	0.00766303	0.0212535	0.00214565
Phi-3-medium-4k-instruct	7.96977	2.4463	0.088553	0.198249	0.0142317	0.0340576	0.00398489
Llama3-8B	4.35559	1.4354	0.0483954	0.123333	0.00777784	0.0227182	0.00217779
Llama-3.1-70B-Japanese-Instruct-2407	39.4303	11.3541	0.438114	0.868475	0.0704112	0.137901	0.0197151
QWen-7B	4.03576	1.34983	0.0448417	0.11722	0.00720671	0.0218461	0.00201788
Qwen2_72B_Instruct	40.5309	11.6534	0.450343	0.890816	0.0723766	0.14132	0.0202654

💡 Latencies computed from hardware specs – no actual inference required

🔧 Basic Parsing & Editing

Intuitive API for model manipulation:

from onnx_tool import Model

model = Model('model.onnx')          # Load any ONNX file
graph = model.graph                  # Access computation graph
node = graph.nodemap['Conv_0']       # Modify operator attributes
tensor = graph.tensormap['weight']   # Edit tensor data/types
model.save_model('modified.onnx')    # Persist changes

See comprehensive examples in benchmark/examples.py.

📊 Shape Inference & Profiling

All profiling relies on precise shape inference:

Profiling Capabilities

Standard profiling: MACs, parameters, memory footprint
Sparse-aware profiling: Quantify sparsity impact on compute

📚 Learn more:

⚙️ Compute Graph & Shape Engine

Transform exported ONNX graphs into efficient Compute Graphs by removing shape-calculation overhead:

Compute Graph: Minimal graph containing only compute operations
Shape Engine: Runtime shape resolver for dynamic models

Use Cases:

Integration with custom inference engines (guide)
Shape regression testing (example)

💾 Memory Compression

Activation Memory Compression

Reuses temporary buffers to minimize peak memory usage – critical for LLMs and high-res CV models.

model	Native Memory Size(MB)	Compressed Memory Size(MB)	Compression Ratio(%)
StableDiffusion(VAE_encoder)	14,245	540	3.7
StableDiffusion(VAE_decoder)	25,417	1,140	4.48
StableDiffusion(Text_encoder)	215	5	2.5
StableDiffusion(UNet)	36,135	2,232	6.2
GPT2	40	2	6.9
BERT	2,170	27	1.25

✅ Typical models achieve >90% activation memory reduction
📌 Implementation: benchmark/compression.py

Weight Compression

Essential for deploying large models on memory-constrained devices:

Quantization Scheme	Size vs FP32	Example (7B model)
FP32 (baseline)	1.00×	28 GB
FP16	0.50×	14 GB
INT8 (per-channel)	0.25×	7 GB
INT4 (block=32, symmetric) – llama.cpp	0.156×	4.4 GB

Supported schemes:

✅ FP16
✅ INT8: symmetric/asymmetric × per-tensor/channel/block
✅ INT4: symmetric/asymmetric × per-tensor/channel/block

📌 See benchmark/examples.py for implementation examples.

🚀 Installation

# PyPI (recommended)
pip install onnx-tool

# Latest development version
pip install --upgrade git+https://github.com/ThanatosShinji/onnx-tool.git

Requirements: Python ≥ 3.6

⚠️ Troubleshooting: If ONNX installation fails, try:
pip install onnx==1.8.1 && pip install onnx-tool

Known Issues

Loop op is not supported
Sequence type is not supported

📈 Model Zoo Results

Comprehensive profiling of ONNX Model Zoo and SOTA models. Input shapes defined in data/public/config.py.

📥 Download pre-profiled models (with full tensor shapes):

Baidu Drive (code: p91k)
Google Drive

Model	Params(M)	MACs(M)
GPT-J 1 layer	464	173,398
MPT 1 layer	261	79,894
text_encoder	123.13	6,782
UNet2DCondition	859.52	888,870
VAE_encoder	34.16	566,371
VAE_decoder	49.49	1,271,959
SqueezeNet 1.0	1.23	351
AlexNet	60.96	665
GoogleNet	6.99	1,606
googlenet_age	5.98	1,605
LResNet100E-IR	65.22	12,102
BERT-Squad	113.61	22,767
BiDAF	18.08	9.87
EfficientNet-Lite4	12.96	1,361
Emotion	12.95	877
Mask R-CNN	46.77	92,077

Model	Params(M)	MACs(M)
LLaMa 1 layer	618	211,801
BEVFormer Tiny	33.7	210,838
rvm_mobilenetv3	3.73	4,289
yolov4	64.33	3,319
ConvNeXt-L	229.79	34,872
edgenext_small	5.58	1,357
SSD	19.98	216,598
RealESRGAN	16.69	73,551
ShuffleNet	2.29	146
GPT-2	137.02	1,103
T5-encoder	109.62	686
T5-decoder	162.62	1,113
RoBERTa-BASE	124.64	688
Faster R-CNN	44.10	46,018
FCN ResNet-50	35.29	37,056
ResNet50	25	3,868

🤝 Contributing

Contributions are welcome! Please open an issue or PR for:

Bug reports
Feature requests
Documentation improvements
New model support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

onnx-tool

🤖 Supported Model Architectures