This server exposes endpoints to run Gemma-3-270M-IT via ONNX Runtime on CPU. It supports advanced text generation with intelligent stopping, quality validation, prompt engineering, and comprehensive sampling controls for production-ready inference.
- Development: http://localhost:5258
- Swagger UI: /swagger
- Method: POST
- URL: /api/Inference/infer-tokens
- Body:
{
"inputIds": [2, 1234, 5678]
}- Response:
{
"output": [/* flattened logits for [batch=1, seq_len, vocab] */]
}Example curl:
curl -k -X POST http://localhost:5258/api/Inference/infer-tokens \
-H "Content-Type: application/json" \
-d '{"inputIds":[2,1234,5678]}' | jq- Method: POST
- URL: /api/Inference/infer
- Body:
{
"prompt": "The capital of France is",
"maxTokens": 32,
"temperature": 0.8,
"topK": 50,
"topP": 0.9,
"repetitionPenalty": 1.2,
"frequencyPenalty": 0.1,
"presencePenalty": 0.1,
"smartStop": true,
"detectTopicDrift": true,
"validateResponse": true,
"systemPrompt": "You are a helpful geography assistant.",
"useFewShot": true,
"stop": ["\n\n", "Human:", "User:"]
}- Response:
{
"completion": "The capital of France is Paris.",
"duration": "1076ms",
"tokensGenerated": 20,
"averageConfidence": 0.9531573,
"qualityValidated": true,
"stopReason": "max_tokens",
"topicDriftDetected": false,
"parameters": {
"temperature": 0.8,
"topK": 50,
"topP": 0.9,
"repetitionPenalty": 1.2,
"frequencyPenalty": 0.1,
"presencePenalty": 0.1,
"smartStop": true,
"detectTopicDrift": true,
"validateResponse": true,
"systemPrompt": "You are a helpful geography assistant.",
"useFewShot": true
}
}Parameters:
prompt(required): Input text to completemaxTokens(optional): Maximum tokens to generate (default: 64)
temperature(optional): Sampling temperature (0.0 = greedy, 1.0 = default)topK(optional): Top-K sampling (limit to top K tokens)topP(optional): Top-P (nucleus) sampling (0.0-1.0)
repetitionPenalty(optional): Reduces repetitive token generation (1.0 = no penalty, >1.0 = reduce repetition)frequencyPenalty(optional): Penalizes frequently used tokens (0.0 = no penalty, >0.0 = reduce frequency)presencePenalty(optional): Penalizes tokens that have already appeared (0.0 = no penalty, >0.0 = reduce presence)
smartStop(optional): Enable intelligent stopping at natural boundaries (default: true)detectTopicDrift(optional): Detect when model goes off-topic and stop (default: true)validateResponse(optional): Validate response quality and filter poor outputs (default: true)minConfidence(optional): Minimum confidence threshold for token selection (0.0-1.0)
systemPrompt(optional): System prompt to provide context and instructionsuseFewShot(optional): Enable few-shot examples for better context (default: false)stop(optional): Array of stop strings to halt generation when encountered
- What it does: Controls how many new tokens (words/characters) the model generates
- Effect on output:
- Low values (5-15): Short, concise responses
- Medium values (20-50): Detailed answers with context
- High values (100+): Long-form content, essays, or stories
- Example:
maxTokens: 10might give "Paris." whilemaxTokens: 50gives "Paris, the beautiful capital city of France known for its art, culture, and the Eiffel Tower."
- What it does: Controls how "creative" or "random" the model's word choices are
- Effect on output:
- 0.0 (Greedy): Always picks the most likely word → Very consistent, predictable responses
- 0.3-0.5: Conservative, factual responses with slight variation
- 0.7-0.9: Balanced creativity and coherence
- 1.0+: More creative, varied, and sometimes unexpected responses
- Example: Same prompt with different temperatures:
temperature: 0.0→ "The capital of France is Paris."temperature: 0.8→ "The capital of France is Paris, a beautiful city known for its art and culture."
- What it does: Limits the model to only consider the top K most likely words at each step
- Effect on output:
- Low values (10-20): Very focused, uses only the most common words
- Medium values (40-80): Good balance of common and specialized vocabulary
- High values (100+): More diverse vocabulary, can include rare or technical terms
- Example:
topK: 20might use simple words like "big city"topK: 80might use "metropolitan area" or "urban center"
- What it does: Dynamically selects words based on cumulative probability, adapting to context
- Effect on output:
- Low values (0.1-0.3): Very focused, uses only the most probable words
- Medium values (0.5-0.8): Good balance, adapts to context
- High values (0.9-0.95): More diverse, considers more word options
- Example: In technical contexts, lower topP keeps responses precise; in creative contexts, higher topP allows more expressive language
- What it does: Reduces the likelihood of repeating tokens that have already been generated
- Effect on output:
- 1.0: No penalty, normal repetition behavior
- 1.1-1.3: Light penalty, reduces obvious repetitions
- 1.5-2.0: Strong penalty, prevents repetitive patterns
- Example:
repetitionPenalty: 1.2prevents "The capital of France is Paris. Paris is the capital. Paris is..." type loops
- What it does: Penalizes tokens based on how frequently they've appeared in the generated text
- Effect on output:
- 0.0: No frequency penalty
- 0.1-0.3: Light penalty, encourages vocabulary diversity
- 0.5+: Strong penalty, forces more varied word choices
- Example: Prevents overuse of common words like "the", "is", "and" in long responses
- What it does: Penalizes tokens that have appeared at least once in the generated text
- Effect on output:
- 0.0: No presence penalty
- 0.1-0.3: Light penalty, encourages new vocabulary
- 0.5+: Strong penalty, forces completely new word choices
- Example: Ensures each word is used only once, creating more diverse responses
- What it does: Automatically stops generation at natural sentence boundaries
- Effect on output:
- true: Stops at complete sentences, prevents mid-sentence cuts
- false: Stops exactly at maxTokens, may cut mid-sentence
- Example: Prevents responses like "The capital of France is Paris. The 16th century was a time of" (incomplete)
- What it does: Detects when the model starts going off-topic and stops generation
- Effect on output:
- true: Maintains topic focus, prevents rambling
- false: Allows free-form generation, may go off-topic
- Example: Stops when "What is the capital of France?" leads to "The 1980s were a period of significant change..."
- What it does: Validates response quality based on confidence and diversity metrics
- Effect on output:
- true: Filters out low-quality responses, ensures coherence
- false: Returns all responses regardless of quality
- Example: Rejects responses with very low confidence scores or excessive repetition
- What it does: Provides context and instructions to guide the model's behavior
- Effect on output:
- null: No additional context
- Custom prompt: Guides model to specific behavior patterns
- Example: "You are a helpful geography assistant" → More focused, helpful responses about geography
- What it does: Automatically adds relevant examples to improve context understanding
- Effect on output:
- false: No examples added
- true: Adds relevant Q&A examples based on prompt content
- Example: For "capital" questions, adds "Q: What is the capital of France? A: Paris" example
{
"temperature": 0.1,
"topK": 40,
"maxTokens": 30,
"smartStop": true,
"detectTopicDrift": true,
"validateResponse": true,
"systemPrompt": "You are a helpful assistant. Provide accurate, concise answers to questions."
}Produces: Accurate, concise, factual responses with high confidence
{
"temperature": 0.8,
"topK": 80,
"topP": 0.9,
"maxTokens": 100,
"repetitionPenalty": 1.1,
"frequencyPenalty": 0.1,
"smartStop": true,
"systemPrompt": "You are a creative writer. Write engaging, imaginative content."
}Produces: Varied, creative, engaging content with controlled repetition
{
"temperature": 0.3,
"topK": 50,
"maxTokens": 200,
"repetitionPenalty": 1.2,
"smartStop": true,
"validateResponse": true,
"systemPrompt": "You are a technical writing assistant. Provide clear, precise, well-structured technical content."
}Produces: Clear, precise, well-structured technical content
{
"temperature": 0.7,
"topK": 60,
"topP": 0.8,
"maxTokens": 50,
"repetitionPenalty": 1.15,
"frequencyPenalty": 0.05,
"smartStop": true,
"detectTopicDrift": true,
"systemPrompt": "You are a friendly, helpful conversational assistant."
}Produces: Natural, engaging, conversational responses
{
"temperature": 0.2,
"topK": 30,
"maxTokens": 150,
"repetitionPenalty": 1.3,
"smartStop": true,
"validateResponse": true,
"systemPrompt": "You are a programming assistant. Write clean, efficient, well-commented code.",
"stop": ["\n\n", "```", "// End of code"]
}Produces: Clean, efficient, well-structured code
{
"temperature": 0.6,
"topK": 70,
"topP": 0.85,
"maxTokens": 300,
"repetitionPenalty": 1.2,
"frequencyPenalty": 0.1,
"presencePenalty": 0.05,
"smartStop": true,
"detectTopicDrift": true,
"systemPrompt": "You are a professional writer. Create well-structured, coherent long-form content."
}Produces: Well-structured, coherent long-form content
# Simple greedy generation
curl -k -X POST http://localhost:5258/api/Inference/infer \
-H "Content-Type: application/json" \
-d '{"prompt":"The capital of France is","maxTokens":32}' | jq
# With sampling parameters
curl -k -X POST http://localhost:5258/api/Inference/infer \
-H "Content-Type: application/json" \
-d '{"prompt":"The capital of France is","maxTokens":50,"temperature":0.8,"topK":50}' | jq# High-quality factual response with penalties
curl -k -X POST http://localhost:5258/api/Inference/infer \
-H "Content-Type: application/json" \
-d '{
"prompt": "What is the capital of Germany?",
"maxTokens": 20,
"temperature": 0.1,
"repetitionPenalty": 1.2,
"smartStop": true,
"detectTopicDrift": true,
"validateResponse": true
}' | jq
# Creative writing with system prompt
curl -k -X POST http://localhost:5258/api/Inference/infer \
-H "Content-Type: application/json" \
-d '{
"prompt": "Write a short story about a robot",
"maxTokens": 100,
"temperature": 0.8,
"topK": 80,
"topP": 0.9,
"repetitionPenalty": 1.1,
"systemPrompt": "You are a creative writer. Write engaging, imaginative stories.",
"smartStop": true
}' | jq
# Technical documentation with few-shot examples
curl -k -X POST http://localhost:5258/api/Inference/infer \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain how to implement a binary search algorithm",
"maxTokens": 200,
"temperature": 0.3,
"topK": 50,
"repetitionPenalty": 1.2,
"systemPrompt": "You are a programming instructor. Provide clear, well-structured technical explanations.",
"useFewShot": true,
"validateResponse": true
}' | jqappsettings.json
{
"ModelOptions": {
"ModelPath": "AIModels/gemma-3-270m-it-ONNX/onnx/model_q4.onnx",
"TokenizerPath": "AIModels/gemma-3-270m-it-ONNX/tokenizer.json",
"ConfigPath": "AIModels/gemma-3-270m-it-ONNX/config.json"
},
"OnnxRuntime": {
"IntraOpNumThreads": 0,
"InterOpNumThreads": 0,
"EnableCpuMemArena": true,
"ExecutionMode": "Sequential"
}
}- Multiple Sampling Methods: Temperature, Top-K, Top-P (nucleus) sampling
- Repetition Control: Repetition, frequency, and presence penalties
- Smart Stopping: Intelligent stopping at natural boundaries and topic drift detection
- Quality Validation: Confidence scoring and response quality filtering
- KV Caching: Efficient single-token generation after initial pass
- Performance Metrics: Comprehensive duration, confidence, and quality tracking
- System Prompts: Context and instruction guidance for better responses
- Few-Shot Learning: Automatic example injection for improved context
- Stop Strings: Custom stopping conditions for precise control
- Response Post-Processing: Clean formatting and quality enhancement
- Confidence Scoring: Real-time confidence tracking for each generated token
- Topic Drift Detection: Automatic detection and stopping when model goes off-topic
- Response Validation: Multi-metric quality validation and filtering
- Natural Language Processing: Intelligent sentence boundary detection
- Gemma Compatible: Proper SentencePiece token handling
- Special Tokens: Automatic BOS/EOS detection from tokenizer.json
- Text Formatting: Clean output without tokenization artifacts
- Advanced Decoding: Sophisticated token-to-text conversion
- CPU Optimized: Efficient ONNX Runtime configuration
- Memory Efficient: KV cache reuse for long sequences
- Fast Inference: ~1 second for 20 tokens on modern CPU
- Production Ready: Enterprise-grade error handling and logging
- Scalable: Designed for high-throughput inference workloads
- KV Cache: Auto-detected at startup (36 layers for Gemma-3-270M)
- Special Tokens: Automatically read from tokenizer.json added_tokens section
- Model Format: Uses quantized ONNX format (model_q4.onnx) for optimal CPU performance
- Threading: Configurable intra/inter-op threads for performance tuning
- Thread Configuration: Set concrete thread counts in appsettings.json for production
- Request Timeouts: Consider adding request timeouts for long-running generations
- Quality Thresholds: Adjust minConfidence and validation parameters based on use case
- Monitoring: Use confidence scores and quality metrics for response monitoring
- Topic Drift Detection: Automatically prevents off-topic rambling in long generations
- Smart Stopping: Ensures responses end at natural sentence boundaries
- Quality Validation: Multi-layered validation prevents low-quality outputs
- Penalty Tuning: Fine-tune repetition, frequency, and presence penalties for optimal results
- System Prompts: Use descriptive system prompts for better context and consistency
- Few-Shot Examples: Enable for complex tasks requiring specific formatting or style
- Parameter Tuning: Start with recommended settings and adjust based on your specific use case
- Response Monitoring: Track confidence scores and quality metrics to optimize parameters