Skip to content

Cstannahill/cpu-inf-local

Repository files navigation

CPUInferenceServer API

This server exposes endpoints to run Gemma-3-270M-IT via ONNX Runtime on CPU. It supports advanced text generation with intelligent stopping, quality validation, prompt engineering, and comprehensive sampling controls for production-ready inference.

Base URL

Endpoints

1) Token logits

  • Method: POST
  • URL: /api/Inference/infer-tokens
  • Body:
{
  "inputIds": [2, 1234, 5678]
}
  • Response:
{
  "output": [/* flattened logits for [batch=1, seq_len, vocab] */]
}

Example curl:

curl -k -X POST http://localhost:5258/api/Inference/infer-tokens \
  -H "Content-Type: application/json" \
  -d '{"inputIds":[2,1234,5678]}' | jq

2) Advanced Text Generation

  • Method: POST
  • URL: /api/Inference/infer
  • Body:
{
  "prompt": "The capital of France is",
  "maxTokens": 32,
  "temperature": 0.8,
  "topK": 50,
  "topP": 0.9,
  "repetitionPenalty": 1.2,
  "frequencyPenalty": 0.1,
  "presencePenalty": 0.1,
  "smartStop": true,
  "detectTopicDrift": true,
  "validateResponse": true,
  "systemPrompt": "You are a helpful geography assistant.",
  "useFewShot": true,
  "stop": ["\n\n", "Human:", "User:"]
}
  • Response:
{
  "completion": "The capital of France is Paris.",
  "duration": "1076ms",
  "tokensGenerated": 20,
  "averageConfidence": 0.9531573,
  "qualityValidated": true,
  "stopReason": "max_tokens",
  "topicDriftDetected": false,
  "parameters": {
    "temperature": 0.8,
    "topK": 50,
    "topP": 0.9,
    "repetitionPenalty": 1.2,
    "frequencyPenalty": 0.1,
    "presencePenalty": 0.1,
    "smartStop": true,
    "detectTopicDrift": true,
    "validateResponse": true,
    "systemPrompt": "You are a helpful geography assistant.",
    "useFewShot": true
  }
}

Parameters:

Core Parameters

  • prompt (required): Input text to complete
  • maxTokens (optional): Maximum tokens to generate (default: 64)

Sampling Controls

  • temperature (optional): Sampling temperature (0.0 = greedy, 1.0 = default)
  • topK (optional): Top-K sampling (limit to top K tokens)
  • topP (optional): Top-P (nucleus) sampling (0.0-1.0)

Advanced Generation Controls

  • repetitionPenalty (optional): Reduces repetitive token generation (1.0 = no penalty, >1.0 = reduce repetition)
  • frequencyPenalty (optional): Penalizes frequently used tokens (0.0 = no penalty, >0.0 = reduce frequency)
  • presencePenalty (optional): Penalizes tokens that have already appeared (0.0 = no penalty, >0.0 = reduce presence)

Smart Stopping & Quality

  • smartStop (optional): Enable intelligent stopping at natural boundaries (default: true)
  • detectTopicDrift (optional): Detect when model goes off-topic and stop (default: true)
  • validateResponse (optional): Validate response quality and filter poor outputs (default: true)
  • minConfidence (optional): Minimum confidence threshold for token selection (0.0-1.0)

Prompt Engineering

  • systemPrompt (optional): System prompt to provide context and instructions
  • useFewShot (optional): Enable few-shot examples for better context (default: false)
  • stop (optional): Array of stop strings to halt generation when encountered

Parameter Effects on Generated Text

maxTokens - Response Length Control

  • What it does: Controls how many new tokens (words/characters) the model generates
  • Effect on output:
    • Low values (5-15): Short, concise responses
    • Medium values (20-50): Detailed answers with context
    • High values (100+): Long-form content, essays, or stories
  • Example: maxTokens: 10 might give "Paris." while maxTokens: 50 gives "Paris, the beautiful capital city of France known for its art, culture, and the Eiffel Tower."

temperature - Creativity vs Consistency

  • What it does: Controls how "creative" or "random" the model's word choices are
  • Effect on output:
    • 0.0 (Greedy): Always picks the most likely word → Very consistent, predictable responses
    • 0.3-0.5: Conservative, factual responses with slight variation
    • 0.7-0.9: Balanced creativity and coherence
    • 1.0+: More creative, varied, and sometimes unexpected responses
  • Example: Same prompt with different temperatures:
    • temperature: 0.0 → "The capital of France is Paris."
    • temperature: 0.8 → "The capital of France is Paris, a beautiful city known for its art and culture."

topK - Vocabulary Diversity

  • What it does: Limits the model to only consider the top K most likely words at each step
  • Effect on output:
    • Low values (10-20): Very focused, uses only the most common words
    • Medium values (40-80): Good balance of common and specialized vocabulary
    • High values (100+): More diverse vocabulary, can include rare or technical terms
  • Example:
    • topK: 20 might use simple words like "big city"
    • topK: 80 might use "metropolitan area" or "urban center"

topP (Nucleus Sampling) - Dynamic Word Selection

  • What it does: Dynamically selects words based on cumulative probability, adapting to context
  • Effect on output:
    • Low values (0.1-0.3): Very focused, uses only the most probable words
    • Medium values (0.5-0.8): Good balance, adapts to context
    • High values (0.9-0.95): More diverse, considers more word options
  • Example: In technical contexts, lower topP keeps responses precise; in creative contexts, higher topP allows more expressive language

repetitionPenalty - Repetition Control

  • What it does: Reduces the likelihood of repeating tokens that have already been generated
  • Effect on output:
    • 1.0: No penalty, normal repetition behavior
    • 1.1-1.3: Light penalty, reduces obvious repetitions
    • 1.5-2.0: Strong penalty, prevents repetitive patterns
  • Example: repetitionPenalty: 1.2 prevents "The capital of France is Paris. Paris is the capital. Paris is..." type loops

frequencyPenalty - Token Frequency Control

  • What it does: Penalizes tokens based on how frequently they've appeared in the generated text
  • Effect on output:
    • 0.0: No frequency penalty
    • 0.1-0.3: Light penalty, encourages vocabulary diversity
    • 0.5+: Strong penalty, forces more varied word choices
  • Example: Prevents overuse of common words like "the", "is", "and" in long responses

presencePenalty - Token Presence Control

  • What it does: Penalizes tokens that have appeared at least once in the generated text
  • Effect on output:
    • 0.0: No presence penalty
    • 0.1-0.3: Light penalty, encourages new vocabulary
    • 0.5+: Strong penalty, forces completely new word choices
  • Example: Ensures each word is used only once, creating more diverse responses

smartStop - Intelligent Stopping

  • What it does: Automatically stops generation at natural sentence boundaries
  • Effect on output:
    • true: Stops at complete sentences, prevents mid-sentence cuts
    • false: Stops exactly at maxTokens, may cut mid-sentence
  • Example: Prevents responses like "The capital of France is Paris. The 16th century was a time of" (incomplete)

detectTopicDrift - Topic Coherence

  • What it does: Detects when the model starts going off-topic and stops generation
  • Effect on output:
    • true: Maintains topic focus, prevents rambling
    • false: Allows free-form generation, may go off-topic
  • Example: Stops when "What is the capital of France?" leads to "The 1980s were a period of significant change..."

validateResponse - Quality Assurance

  • What it does: Validates response quality based on confidence and diversity metrics
  • Effect on output:
    • true: Filters out low-quality responses, ensures coherence
    • false: Returns all responses regardless of quality
  • Example: Rejects responses with very low confidence scores or excessive repetition

systemPrompt - Context Enhancement

  • What it does: Provides context and instructions to guide the model's behavior
  • Effect on output:
    • null: No additional context
    • Custom prompt: Guides model to specific behavior patterns
  • Example: "You are a helpful geography assistant" → More focused, helpful responses about geography

useFewShot - Example Learning

  • What it does: Automatically adds relevant examples to improve context understanding
  • Effect on output:
    • false: No examples added
    • true: Adds relevant Q&A examples based on prompt content
  • Example: For "capital" questions, adds "Q: What is the capital of France? A: Paris" example

Recommended Settings by Use Case

Factual Q&A (Information retrieval)

{
  "temperature": 0.1,
  "topK": 40,
  "maxTokens": 30,
  "smartStop": true,
  "detectTopicDrift": true,
  "validateResponse": true,
  "systemPrompt": "You are a helpful assistant. Provide accurate, concise answers to questions."
}

Produces: Accurate, concise, factual responses with high confidence

Creative Writing (Stories, poems, creative content)

{
  "temperature": 0.8,
  "topK": 80,
  "topP": 0.9,
  "maxTokens": 100,
  "repetitionPenalty": 1.1,
  "frequencyPenalty": 0.1,
  "smartStop": true,
  "systemPrompt": "You are a creative writer. Write engaging, imaginative content."
}

Produces: Varied, creative, engaging content with controlled repetition

Technical Documentation (Code, instructions, technical writing)

{
  "temperature": 0.3,
  "topK": 50,
  "maxTokens": 200,
  "repetitionPenalty": 1.2,
  "smartStop": true,
  "validateResponse": true,
  "systemPrompt": "You are a technical writing assistant. Provide clear, precise, well-structured technical content."
}

Produces: Clear, precise, well-structured technical content

Conversational (Chat, casual discussion)

{
  "temperature": 0.7,
  "topK": 60,
  "topP": 0.8,
  "maxTokens": 50,
  "repetitionPenalty": 1.15,
  "frequencyPenalty": 0.05,
  "smartStop": true,
  "detectTopicDrift": true,
  "systemPrompt": "You are a friendly, helpful conversational assistant."
}

Produces: Natural, engaging, conversational responses

Code Generation (Programming, scripts)

{
  "temperature": 0.2,
  "topK": 30,
  "maxTokens": 150,
  "repetitionPenalty": 1.3,
  "smartStop": true,
  "validateResponse": true,
  "systemPrompt": "You are a programming assistant. Write clean, efficient, well-commented code.",
  "stop": ["\n\n", "```", "// End of code"]
}

Produces: Clean, efficient, well-structured code

Long-form Content (Essays, articles, reports)

{
  "temperature": 0.6,
  "topK": 70,
  "topP": 0.85,
  "maxTokens": 300,
  "repetitionPenalty": 1.2,
  "frequencyPenalty": 0.1,
  "presencePenalty": 0.05,
  "smartStop": true,
  "detectTopicDrift": true,
  "systemPrompt": "You are a professional writer. Create well-structured, coherent long-form content."
}

Produces: Well-structured, coherent long-form content

Example API Calls

Basic Generation

# Simple greedy generation
curl -k -X POST http://localhost:5258/api/Inference/infer \
  -H "Content-Type: application/json" \
  -d '{"prompt":"The capital of France is","maxTokens":32}' | jq

# With sampling parameters
curl -k -X POST http://localhost:5258/api/Inference/infer \
  -H "Content-Type: application/json" \
  -d '{"prompt":"The capital of France is","maxTokens":50,"temperature":0.8,"topK":50}' | jq

Advanced Generation

# High-quality factual response with penalties
curl -k -X POST http://localhost:5258/api/Inference/infer \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What is the capital of Germany?",
    "maxTokens": 20,
    "temperature": 0.1,
    "repetitionPenalty": 1.2,
    "smartStop": true,
    "detectTopicDrift": true,
    "validateResponse": true
  }' | jq

# Creative writing with system prompt
curl -k -X POST http://localhost:5258/api/Inference/infer \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Write a short story about a robot",
    "maxTokens": 100,
    "temperature": 0.8,
    "topK": 80,
    "topP": 0.9,
    "repetitionPenalty": 1.1,
    "systemPrompt": "You are a creative writer. Write engaging, imaginative stories.",
    "smartStop": true
  }' | jq

# Technical documentation with few-shot examples
curl -k -X POST http://localhost:5258/api/Inference/infer \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain how to implement a binary search algorithm",
    "maxTokens": 200,
    "temperature": 0.3,
    "topK": 50,
    "repetitionPenalty": 1.2,
    "systemPrompt": "You are a programming instructor. Provide clear, well-structured technical explanations.",
    "useFewShot": true,
    "validateResponse": true
  }' | jq

Configuration

appsettings.json

{
  "ModelOptions": {
    "ModelPath": "AIModels/gemma-3-270m-it-ONNX/onnx/model_q4.onnx",
    "TokenizerPath": "AIModels/gemma-3-270m-it-ONNX/tokenizer.json",
    "ConfigPath": "AIModels/gemma-3-270m-it-ONNX/config.json"
  },
  "OnnxRuntime": {
    "IntraOpNumThreads": 0,
    "InterOpNumThreads": 0,
    "EnableCpuMemArena": true,
    "ExecutionMode": "Sequential"
  }
}

Features

Advanced Text Generation

  • Multiple Sampling Methods: Temperature, Top-K, Top-P (nucleus) sampling
  • Repetition Control: Repetition, frequency, and presence penalties
  • Smart Stopping: Intelligent stopping at natural boundaries and topic drift detection
  • Quality Validation: Confidence scoring and response quality filtering
  • KV Caching: Efficient single-token generation after initial pass
  • Performance Metrics: Comprehensive duration, confidence, and quality tracking

Prompt Engineering

  • System Prompts: Context and instruction guidance for better responses
  • Few-Shot Learning: Automatic example injection for improved context
  • Stop Strings: Custom stopping conditions for precise control
  • Response Post-Processing: Clean formatting and quality enhancement

Quality Assurance

  • Confidence Scoring: Real-time confidence tracking for each generated token
  • Topic Drift Detection: Automatic detection and stopping when model goes off-topic
  • Response Validation: Multi-metric quality validation and filtering
  • Natural Language Processing: Intelligent sentence boundary detection

Tokenizer

  • Gemma Compatible: Proper SentencePiece token handling
  • Special Tokens: Automatic BOS/EOS detection from tokenizer.json
  • Text Formatting: Clean output without tokenization artifacts
  • Advanced Decoding: Sophisticated token-to-text conversion

Performance & Reliability

  • CPU Optimized: Efficient ONNX Runtime configuration
  • Memory Efficient: KV cache reuse for long sequences
  • Fast Inference: ~1 second for 20 tokens on modern CPU
  • Production Ready: Enterprise-grade error handling and logging
  • Scalable: Designed for high-throughput inference workloads

Notes

Technical Details

  • KV Cache: Auto-detected at startup (36 layers for Gemma-3-270M)
  • Special Tokens: Automatically read from tokenizer.json added_tokens section
  • Model Format: Uses quantized ONNX format (model_q4.onnx) for optimal CPU performance
  • Threading: Configurable intra/inter-op threads for performance tuning

Production Considerations

  • Thread Configuration: Set concrete thread counts in appsettings.json for production
  • Request Timeouts: Consider adding request timeouts for long-running generations
  • Quality Thresholds: Adjust minConfidence and validation parameters based on use case
  • Monitoring: Use confidence scores and quality metrics for response monitoring

Advanced Features

  • Topic Drift Detection: Automatically prevents off-topic rambling in long generations
  • Smart Stopping: Ensures responses end at natural sentence boundaries
  • Quality Validation: Multi-layered validation prevents low-quality outputs
  • Penalty Tuning: Fine-tune repetition, frequency, and presence penalties for optimal results

Best Practices

  • System Prompts: Use descriptive system prompts for better context and consistency
  • Few-Shot Examples: Enable for complex tasks requiring specific formatting or style
  • Parameter Tuning: Start with recommended settings and adjust based on your specific use case
  • Response Monitoring: Track confidence scores and quality metrics to optimize parameters

About

.NET 9 Web API provided local inference via quantized ONNX gemma-3-270m

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published