CPUInferenceServer API

This server exposes endpoints to run Gemma-3-270M-IT via ONNX Runtime on CPU. It supports advanced text generation with intelligent stopping, quality validation, prompt engineering, and comprehensive sampling controls for production-ready inference.

Base URL

Development: http://localhost:5258
Swagger UI: /swagger

Endpoints

1) Token logits

Method: POST
URL: /api/Inference/infer-tokens
Body:

{
  "inputIds": [2, 1234, 5678]
}

Response:

{
  "output": [/* flattened logits for [batch=1, seq_len, vocab] */]
}

Example curl:

curl -k -X POST http://localhost:5258/api/Inference/infer-tokens \
  -H "Content-Type: application/json" \
  -d '{"inputIds":[2,1234,5678]}' | jq

2) Advanced Text Generation

Method: POST
URL: /api/Inference/infer
Body:

{
  "prompt": "The capital of France is",
  "maxTokens": 32,
  "temperature": 0.8,
  "topK": 50,
  "topP": 0.9,
  "repetitionPenalty": 1.2,
  "frequencyPenalty": 0.1,
  "presencePenalty": 0.1,
  "smartStop": true,
  "detectTopicDrift": true,
  "validateResponse": true,
  "systemPrompt": "You are a helpful geography assistant.",
  "useFewShot": true,
  "stop": ["\n\n", "Human:", "User:"]
}

Response:

{
  "completion": "The capital of France is Paris.",
  "duration": "1076ms",
  "tokensGenerated": 20,
  "averageConfidence": 0.9531573,
  "qualityValidated": true,
  "stopReason": "max_tokens",
  "topicDriftDetected": false,
  "parameters": {
    "temperature": 0.8,
    "topK": 50,
    "topP": 0.9,
    "repetitionPenalty": 1.2,
    "frequencyPenalty": 0.1,
    "presencePenalty": 0.1,
    "smartStop": true,
    "detectTopicDrift": true,
    "validateResponse": true,
    "systemPrompt": "You are a helpful geography assistant.",
    "useFewShot": true
  }
}

Parameters:

Core Parameters

prompt (required): Input text to complete
maxTokens (optional): Maximum tokens to generate (default: 64)

Sampling Controls

temperature (optional): Sampling temperature (0.0 = greedy, 1.0 = default)
topK (optional): Top-K sampling (limit to top K tokens)
topP (optional): Top-P (nucleus) sampling (0.0-1.0)

Advanced Generation Controls

repetitionPenalty (optional): Reduces repetitive token generation (1.0 = no penalty, >1.0 = reduce repetition)
frequencyPenalty (optional): Penalizes frequently used tokens (0.0 = no penalty, >0.0 = reduce frequency)
presencePenalty (optional): Penalizes tokens that have already appeared (0.0 = no penalty, >0.0 = reduce presence)

Smart Stopping & Quality

smartStop (optional): Enable intelligent stopping at natural boundaries (default: true)
detectTopicDrift (optional): Detect when model goes off-topic and stop (default: true)
validateResponse (optional): Validate response quality and filter poor outputs (default: true)
minConfidence (optional): Minimum confidence threshold for token selection (0.0-1.0)

Prompt Engineering

systemPrompt (optional): System prompt to provide context and instructions
useFewShot (optional): Enable few-shot examples for better context (default: false)
stop (optional): Array of stop strings to halt generation when encountered

Parameter Effects on Generated Text

`maxTokens` - Response Length Control

What it does: Controls how many new tokens (words/characters) the model generates
Effect on output:
- Low values (5-15): Short, concise responses
- Medium values (20-50): Detailed answers with context
- High values (100+): Long-form content, essays, or stories
Example: maxTokens: 10 might give "Paris." while maxTokens: 50 gives "Paris, the beautiful capital city of France known for its art, culture, and the Eiffel Tower."

`temperature` - Creativity vs Consistency

What it does: Controls how "creative" or "random" the model's word choices are
Effect on output:
- 0.0 (Greedy): Always picks the most likely word → Very consistent, predictable responses
- 0.3-0.5: Conservative, factual responses with slight variation
- 0.7-0.9: Balanced creativity and coherence
- 1.0+: More creative, varied, and sometimes unexpected responses
Example: Same prompt with different temperatures:
- temperature: 0.0 → "The capital of France is Paris."
- temperature: 0.8 → "The capital of France is Paris, a beautiful city known for its art and culture."

`topK` - Vocabulary Diversity

What it does: Limits the model to only consider the top K most likely words at each step
Effect on output:
- Low values (10-20): Very focused, uses only the most common words
- Medium values (40-80): Good balance of common and specialized vocabulary
- High values (100+): More diverse vocabulary, can include rare or technical terms
Example:
- topK: 20 might use simple words like "big city"
- topK: 80 might use "metropolitan area" or "urban center"

`topP` (Nucleus Sampling) - Dynamic Word Selection

What it does: Dynamically selects words based on cumulative probability, adapting to context
Effect on output:
- Low values (0.1-0.3): Very focused, uses only the most probable words
- Medium values (0.5-0.8): Good balance, adapts to context
- High values (0.9-0.95): More diverse, considers more word options
Example: In technical contexts, lower topP keeps responses precise; in creative contexts, higher topP allows more expressive language

`repetitionPenalty` - Repetition Control

What it does: Reduces the likelihood of repeating tokens that have already been generated
Effect on output:
- 1.0: No penalty, normal repetition behavior
- 1.1-1.3: Light penalty, reduces obvious repetitions
- 1.5-2.0: Strong penalty, prevents repetitive patterns
Example: repetitionPenalty: 1.2 prevents "The capital of France is Paris. Paris is the capital. Paris is..." type loops

`frequencyPenalty` - Token Frequency Control

What it does: Penalizes tokens based on how frequently they've appeared in the generated text
Effect on output:
- 0.0: No frequency penalty
- 0.1-0.3: Light penalty, encourages vocabulary diversity
- 0.5+: Strong penalty, forces more varied word choices
Example: Prevents overuse of common words like "the", "is", "and" in long responses

`presencePenalty` - Token Presence Control

What it does: Penalizes tokens that have appeared at least once in the generated text
Effect on output:
- 0.0: No presence penalty
- 0.1-0.3: Light penalty, encourages new vocabulary
- 0.5+: Strong penalty, forces completely new word choices
Example: Ensures each word is used only once, creating more diverse responses

`smartStop` - Intelligent Stopping

What it does: Automatically stops generation at natural sentence boundaries
Effect on output:
- true: Stops at complete sentences, prevents mid-sentence cuts
- false: Stops exactly at maxTokens, may cut mid-sentence
Example: Prevents responses like "The capital of France is Paris. The 16th century was a time of" (incomplete)

`detectTopicDrift` - Topic Coherence

What it does: Detects when the model starts going off-topic and stops generation
Effect on output:
- true: Maintains topic focus, prevents rambling
- false: Allows free-form generation, may go off-topic
Example: Stops when "What is the capital of France?" leads to "The 1980s were a period of significant change..."

`validateResponse` - Quality Assurance

What it does: Validates response quality based on confidence and diversity metrics
Effect on output:
- true: Filters out low-quality responses, ensures coherence
- false: Returns all responses regardless of quality
Example: Rejects responses with very low confidence scores or excessive repetition

`systemPrompt` - Context Enhancement

What it does: Provides context and instructions to guide the model's behavior
Effect on output:
- null: No additional context
- Custom prompt: Guides model to specific behavior patterns
Example: "You are a helpful geography assistant" → More focused, helpful responses about geography

`useFewShot` - Example Learning

What it does: Automatically adds relevant examples to improve context understanding
Effect on output:
- false: No examples added
- true: Adds relevant Q&A examples based on prompt content
Example: For "capital" questions, adds "Q: What is the capital of France? A: Paris" example

Recommended Settings by Use Case

Factual Q&A (Information retrieval)

{
  "temperature": 0.1,
  "topK": 40,
  "maxTokens": 30,
  "smartStop": true,
  "detectTopicDrift": true,
  "validateResponse": true,
  "systemPrompt": "You are a helpful assistant. Provide accurate, concise answers to questions."
}

Produces: Accurate, concise, factual responses with high confidence

Creative Writing (Stories, poems, creative content)

{
  "temperature": 0.8,
  "topK": 80,
  "topP": 0.9,
  "maxTokens": 100,
  "repetitionPenalty": 1.1,
  "frequencyPenalty": 0.1,
  "smartStop": true,
  "systemPrompt": "You are a creative writer. Write engaging, imaginative content."
}

Produces: Varied, creative, engaging content with controlled repetition

Technical Documentation (Code, instructions, technical writing)

{
  "temperature": 0.3,
  "topK": 50,
  "maxTokens": 200,
  "repetitionPenalty": 1.2,
  "smartStop": true,
  "validateResponse": true,
  "systemPrompt": "You are a technical writing assistant. Provide clear, precise, well-structured technical content."
}

Produces: Clear, precise, well-structured technical content

Conversational (Chat, casual discussion)

{
  "temperature": 0.7,
  "topK": 60,
  "topP": 0.8,
  "maxTokens": 50,
  "repetitionPenalty": 1.15,
  "frequencyPenalty": 0.05,
  "smartStop": true,
  "detectTopicDrift": true,
  "systemPrompt": "You are a friendly, helpful conversational assistant."
}

Produces: Natural, engaging, conversational responses

Code Generation (Programming, scripts)

{
  "temperature": 0.2,
  "topK": 30,
  "maxTokens": 150,
  "repetitionPenalty": 1.3,
  "smartStop": true,
  "validateResponse": true,
  "systemPrompt": "You are a programming assistant. Write clean, efficient, well-commented code.",
  "stop": ["\n\n", "```", "// End of code"]
}

Produces: Clean, efficient, well-structured code

Long-form Content (Essays, articles, reports)

{
  "temperature": 0.6,
  "topK": 70,
  "topP": 0.85,
  "maxTokens": 300,
  "repetitionPenalty": 1.2,
  "frequencyPenalty": 0.1,
  "presencePenalty": 0.05,
  "smartStop": true,
  "detectTopicDrift": true,
  "systemPrompt": "You are a professional writer. Create well-structured, coherent long-form content."
}

Produces: Well-structured, coherent long-form content

Example API Calls

Basic Generation

# Simple greedy generation
curl -k -X POST http://localhost:5258/api/Inference/infer \
  -H "Content-Type: application/json" \
  -d '{"prompt":"The capital of France is","maxTokens":32}' | jq

# With sampling parameters
curl -k -X POST http://localhost:5258/api/Inference/infer \
  -H "Content-Type: application/json" \
  -d '{"prompt":"The capital of France is","maxTokens":50,"temperature":0.8,"topK":50}' | jq

Advanced Generation

# High-quality factual response with penalties
curl -k -X POST http://localhost:5258/api/Inference/infer \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What is the capital of Germany?",
    "maxTokens": 20,
    "temperature": 0.1,
    "repetitionPenalty": 1.2,
    "smartStop": true,
    "detectTopicDrift": true,
    "validateResponse": true
  }' | jq

# Creative writing with system prompt
curl -k -X POST http://localhost:5258/api/Inference/infer \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Write a short story about a robot",
    "maxTokens": 100,
    "temperature": 0.8,
    "topK": 80,
    "topP": 0.9,
    "repetitionPenalty": 1.1,
    "systemPrompt": "You are a creative writer. Write engaging, imaginative stories.",
    "smartStop": true
  }' | jq

# Technical documentation with few-shot examples
curl -k -X POST http://localhost:5258/api/Inference/infer \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain how to implement a binary search algorithm",
    "maxTokens": 200,
    "temperature": 0.3,
    "topK": 50,
    "repetitionPenalty": 1.2,
    "systemPrompt": "You are a programming instructor. Provide clear, well-structured technical explanations.",
    "useFewShot": true,
    "validateResponse": true
  }' | jq

Configuration

appsettings.json

{
  "ModelOptions": {
    "ModelPath": "AIModels/gemma-3-270m-it-ONNX/onnx/model_q4.onnx",
    "TokenizerPath": "AIModels/gemma-3-270m-it-ONNX/tokenizer.json",
    "ConfigPath": "AIModels/gemma-3-270m-it-ONNX/config.json"
  },
  "OnnxRuntime": {
    "IntraOpNumThreads": 0,
    "InterOpNumThreads": 0,
    "EnableCpuMemArena": true,
    "ExecutionMode": "Sequential"
  }
}

Features

Advanced Text Generation

Multiple Sampling Methods: Temperature, Top-K, Top-P (nucleus) sampling
Repetition Control: Repetition, frequency, and presence penalties
Smart Stopping: Intelligent stopping at natural boundaries and topic drift detection
Quality Validation: Confidence scoring and response quality filtering
KV Caching: Efficient single-token generation after initial pass
Performance Metrics: Comprehensive duration, confidence, and quality tracking

Prompt Engineering

System Prompts: Context and instruction guidance for better responses
Few-Shot Learning: Automatic example injection for improved context
Stop Strings: Custom stopping conditions for precise control
Response Post-Processing: Clean formatting and quality enhancement

Quality Assurance

Confidence Scoring: Real-time confidence tracking for each generated token
Topic Drift Detection: Automatic detection and stopping when model goes off-topic
Response Validation: Multi-metric quality validation and filtering
Natural Language Processing: Intelligent sentence boundary detection

Tokenizer

Gemma Compatible: Proper SentencePiece token handling
Special Tokens: Automatic BOS/EOS detection from tokenizer.json
Text Formatting: Clean output without tokenization artifacts
Advanced Decoding: Sophisticated token-to-text conversion

Performance & Reliability

CPU Optimized: Efficient ONNX Runtime configuration
Memory Efficient: KV cache reuse for long sequences
Fast Inference: ~1 second for 20 tokens on modern CPU
Production Ready: Enterprise-grade error handling and logging
Scalable: Designed for high-throughput inference workloads

Notes

Technical Details

KV Cache: Auto-detected at startup (36 layers for Gemma-3-270M)
Special Tokens: Automatically read from tokenizer.json added_tokens section
Model Format: Uses quantized ONNX format (model_q4.onnx) for optimal CPU performance
Threading: Configurable intra/inter-op threads for performance tuning

Production Considerations

Thread Configuration: Set concrete thread counts in appsettings.json for production
Request Timeouts: Consider adding request timeouts for long-running generations
Quality Thresholds: Adjust minConfidence and validation parameters based on use case
Monitoring: Use confidence scores and quality metrics for response monitoring

Advanced Features

Topic Drift Detection: Automatically prevents off-topic rambling in long generations
Smart Stopping: Ensures responses end at natural sentence boundaries
Quality Validation: Multi-layered validation prevents low-quality outputs
Penalty Tuning: Fine-tune repetition, frequency, and presence penalties for optimal results

Best Practices

System Prompts: Use descriptive system prompts for better context and consistency
Few-Shot Examples: Enable for complex tasks requiring specific formatting or style
Parameter Tuning: Start with recommended settings and adjust based on your specific use case
Response Monitoring: Track confidence scores and quality metrics to optimize parameters

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
AIModels/gemma-3-270m-it-ONNX		AIModels/gemma-3-270m-it-ONNX
Controllers		Controllers
Docs		Docs
Models		Models
Properties		Properties
Services		Services
Swagger		Swagger
.dockerignore		.dockerignore
.gitignore		.gitignore
CPUInferenceServer.csproj		CPUInferenceServer.csproj
CPUInferenceServer.http		CPUInferenceServer.http
CPUInferenceServer.sln		CPUInferenceServer.sln
Program.cs		Program.cs
README.md		README.md
WeatherForecast.cs		WeatherForecast.cs
appsettings.json		appsettings.json

Cstannahill/cpu-inf-local

Folders and files

Latest commit

History

Repository files navigation

CPUInferenceServer API

Base URL

Endpoints

1) Token logits

2) Advanced Text Generation

Core Parameters

Sampling Controls

Advanced Generation Controls

Smart Stopping & Quality

Prompt Engineering

Parameter Effects on Generated Text

maxTokens - Response Length Control

temperature - Creativity vs Consistency

topK - Vocabulary Diversity

topP (Nucleus Sampling) - Dynamic Word Selection

repetitionPenalty - Repetition Control

frequencyPenalty - Token Frequency Control

presencePenalty - Token Presence Control

smartStop - Intelligent Stopping

detectTopicDrift - Topic Coherence

validateResponse - Quality Assurance

systemPrompt - Context Enhancement

useFewShot - Example Learning

Recommended Settings by Use Case

Factual Q&A (Information retrieval)

Creative Writing (Stories, poems, creative content)

Technical Documentation (Code, instructions, technical writing)

Conversational (Chat, casual discussion)

Code Generation (Programming, scripts)

Long-form Content (Essays, articles, reports)

Example API Calls

Basic Generation

Advanced Generation

Configuration

Features

Advanced Text Generation

Prompt Engineering

Quality Assurance

Tokenizer

Performance & Reliability

Notes

Technical Details

Production Considerations

Advanced Features

Best Practices

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`maxTokens` - Response Length Control

`temperature` - Creativity vs Consistency

`topK` - Vocabulary Diversity

`topP` (Nucleus Sampling) - Dynamic Word Selection

`repetitionPenalty` - Repetition Control

`frequencyPenalty` - Token Frequency Control

`presencePenalty` - Token Presence Control

`smartStop` - Intelligent Stopping

`detectTopicDrift` - Topic Coherence

`validateResponse` - Quality Assurance

`systemPrompt` - Context Enhancement

`useFewShot` - Example Learning

Packages