feat(screen-intelligence): OCR-only mode without vision model#424
feat(screen-intelligence): OCR-only mode without vision model#424senamakel merged 4 commits intotinyhumansai:mainfrom
Conversation
…Panel - Introduced a new checkbox in the Screen Intelligence Panel to toggle the use of a vision model for richer context extraction from screenshots. - Updated state management to handle the new option and integrated it into the configuration and processing logic. - Adjusted related tests and configurations to support the new feature, ensuring compatibility across the application.
- Introduced a new command-line option `--no-vision-model` to allow users to skip the vision model and use OCR and text LLM only. - Updated the CLI options parsing to handle the new flag and modified the bootstrap logic to respect this setting. - Enhanced usage documentation to reflect the new option and its alias `--ocr-only` for clarity.
…onfig The processing worker was reading use_vision_model from the persisted config file (Config::load_or_init), so the CLI --no-vision-model flag had no effect. Now reads from the engine's in-memory runtime config which the CLI correctly overrides via apply_config(). Also moves image compression before OCR pass.
Add the new use_vision_model field to all AccessibilityConfig test fixtures so TypeScript compilation passes. Also includes rustfmt auto-fix for screen_intelligence_cli.rs.
📝 WalkthroughWalkthroughThis pull request introduces a new Changes
Sequence DiagramsequenceDiagram
participant User as Frontend User
participant UI as Settings UI
participant RPC as RPC Layer
participant Config as Config Service
participant Engine as Accessibility Engine
participant Worker as Processing Worker
User->>UI: Toggle "Use Vision Model"
UI->>RPC: Call openhumanUpdateScreenIntelligenceSettings(use_vision_model)
RPC->>Config: apply_screen_intelligence_settings(use_vision_model)
Config->>Config: Update ScreenIntelligenceConfig.use_vision_model
Config->>Engine: Reload config into global engine
Note over Engine: Config updated with new flag
Worker->>Engine: Read use_vision_model from engine.config
alt use_vision_model == true
Worker->>Worker: Run Pass 2: Vision LLM context
Worker->>Worker: confidence = 0.9
else use_vision_model == false
Worker->>Worker: Skip Pass 2 (vision disabled)
Worker->>Worker: confidence = 0.75
end
Worker->>Worker: Return VisionSummary with conditional processing
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/openhuman/config/schemas.rs (1)
270-305: Missinguse_vision_modelin controller schema inputs.The
ScreenIntelligenceSettingsUpdatestruct now includesuse_vision_model, and the handler correctly forwards it to the patch. However, the controller schema metadata inschemas("update_screen_intelligence_settings")does not include this field in itsinputsvector.While the RPC will still work (serde deserializes based on the struct), the schema metadata used for documentation and introspection will be incomplete.
♻️ Proposed fix to add the missing schema input
optional_bool("vision_enabled", "Enable vision analysis."), optional_bool("autocomplete_enabled", "Enable autocomplete integration."), + optional_bool("use_vision_model", "Use vision LLM for frame analysis (false = OCR-only mode)."), optional_bool("keep_screenshots", "Keep screenshots on disk after vision processing."), FieldSchema { name: "allowlist",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/config/schemas.rs` around lines 270 - 305, The controller schema for "update_screen_intelligence_settings" is missing the "use_vision_model" input, causing documentation/schema introspection to be incomplete; add an optional_bool field named "use_vision_model" with an appropriate comment (e.g., "Use the vision model for analysis.") into the inputs vector of the ControllerSchema for function "update_screen_intelligence_settings" so the metadata matches the ScreenIntelligenceSettingsUpdate struct and the handler forwarding logic.src/openhuman/screen_intelligence/processing_worker.rs (1)
333-340: Consider: Fallback text produces leading newlines when no vision context.When
use_vision_model=false,fallback_textis"". If synthesis fails, line 339 produces"\n\n{ocr_truncated}"with leading blank lines.Suggested tweak
- let fallback_text = vision_context.as_deref().unwrap_or(""); let synthesis = service .prompt(&config, &synthesis_prompt, Some(700), true) .await .unwrap_or_else(|e| { tracing::debug!("[processing_worker] synthesis failed, using fallback: {e}"); - format!("{}\n\n{}", fallback_text, ocr_truncated) + match &vision_context { + Some(vc) => format!("{}\n\n{}", vc, ocr_truncated), + None => ocr_truncated.to_string(), + } });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/openhuman/screen_intelligence/processing_worker.rs` around lines 333 - 340, The fallback construction can produce leading newlines when vision_context is None because fallback_text is "", so adjust the synthesis error handler (the closure for service.prompt(...).unwrap_or_else) to check fallback_text and avoid prepending "\n\n" when it's empty; i.e., use conditional logic around fallback_text (the variable defined above) to either return ocr_truncated alone or format!("{}\n\n{}", fallback_text, ocr_truncated) when fallback_text is non-empty so you don't get leading blank lines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/openhuman/config/schemas.rs`:
- Around line 270-305: The controller schema for
"update_screen_intelligence_settings" is missing the "use_vision_model" input,
causing documentation/schema introspection to be incomplete; add an
optional_bool field named "use_vision_model" with an appropriate comment (e.g.,
"Use the vision model for analysis.") into the inputs vector of the
ControllerSchema for function "update_screen_intelligence_settings" so the
metadata matches the ScreenIntelligenceSettingsUpdate struct and the handler
forwarding logic.
In `@src/openhuman/screen_intelligence/processing_worker.rs`:
- Around line 333-340: The fallback construction can produce leading newlines
when vision_context is None because fallback_text is "", so adjust the synthesis
error handler (the closure for service.prompt(...).unwrap_or_else) to check
fallback_text and avoid prepending "\n\n" when it's empty; i.e., use conditional
logic around fallback_text (the variable defined above) to either return
ocr_truncated alone or format!("{}\n\n{}", fallback_text, ocr_truncated) when
fallback_text is non-empty so you don't get leading blank lines.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7c8fbc95-e74f-4066-804f-f7547b921c71
📒 Files selected for processing (14)
app/src/components/intelligence/__tests__/ScreenIntelligenceDebugPanel.test.tsxapp/src/components/settings/panels/ScreenIntelligencePanel.tsxapp/src/components/settings/panels/__tests__/AccessibilityPanel.test.tsxapp/src/components/settings/panels/__tests__/ScreenIntelligencePanel.test.tsxapp/src/pages/onboarding/steps/__tests__/ScreenPermissionsStep.test.tsxapp/src/services/__tests__/coreRpcClient.test.tsapp/src/store/__tests__/accessibilitySlice.test.tsapp/src/utils/tauriCommands/accessibility.tsapp/src/utils/tauriCommands/config.tssrc/core/screen_intelligence_cli.rssrc/openhuman/config/ops.rssrc/openhuman/config/schema/accessibility.rssrc/openhuman/config/schemas.rssrc/openhuman/screen_intelligence/processing_worker.rs
Summary
use_vision_modelconfig flag (default:true) to screen intelligence. Whenfalse, the vision LLM pass is skipped — only Apple Vision OCR feeds into a text-only synthesis LLM. No vision-capable model required, faster processing.--no-vision-model/--ocr-onlyCLI flags foropenhuman screen-intelligence runandstartsubcommands.use_vision_modelfrom the persisted config file instead of the engine's runtime config, so CLI overrides had no effect.Changes
src/openhuman/config/schema/accessibility.rs— Newuse_vision_modelfield onScreenIntelligenceConfigsrc/openhuman/screen_intelligence/processing_worker.rs— Reads flag from engine runtime config; skips vision LLM pass when false; uses OCR-only synthesis prompt; compression runs unconditionally before OCRsrc/openhuman/config/ops.rs+schemas.rs— Wired through settings patch and RPC handlersrc/core/screen_intelligence_cli.rs—--no-vision-model/--ocr-onlyflags onrunandstart; updated help text and doctor outputapp/src/utils/tauriCommands/— TS types updatedapp/src/components/settings/panels/ScreenIntelligencePanel.tsx— UI toggleuse_vision_modelto allAccessibilityConfigmocksTest plan
cargo checkpassestsc --noEmitpassescargo run screen-intelligence run --no-vision-model -vskips vision LLM, logsuse_vision_model=falsecargo run screen-intelligence run -v(without flag) uses vision LLM as beforecargo run screen-intelligence doctorshowsuse_vision_modelin config outputSummary by CodeRabbit
New Features
--ocr-only/--no-vision-modelcommand-line options to disable vision model on startup.