Skip to content

feat(screen-intelligence): OCR-only mode without vision model#424

Merged
senamakel merged 4 commits intotinyhumansai:mainfrom
senamakel:feat/vision-model
Apr 8, 2026
Merged

feat(screen-intelligence): OCR-only mode without vision model#424
senamakel merged 4 commits intotinyhumansai:mainfrom
senamakel:feat/vision-model

Conversation

@senamakel
Copy link
Copy Markdown
Member

@senamakel senamakel commented Apr 8, 2026

Summary

  • Adds a use_vision_model config flag (default: true) to screen intelligence. When false, the vision LLM pass is skipped — only Apple Vision OCR feeds into a text-only synthesis LLM. No vision-capable model required, faster processing.
  • Adds --no-vision-model / --ocr-only CLI flags for openhuman screen-intelligence run and start subcommands.
  • Adds "Use Vision Model" toggle to the Screen Intelligence settings panel in the UI.
  • Fixes a bug where the processing worker read use_vision_model from the persisted config file instead of the engine's runtime config, so CLI overrides had no effect.
  • Image compression always runs regardless of the flag (before OCR pass).

Changes

  • src/openhuman/config/schema/accessibility.rs — New use_vision_model field on ScreenIntelligenceConfig
  • src/openhuman/screen_intelligence/processing_worker.rs — Reads flag from engine runtime config; skips vision LLM pass when false; uses OCR-only synthesis prompt; compression runs unconditionally before OCR
  • src/openhuman/config/ops.rs + schemas.rs — Wired through settings patch and RPC handler
  • src/core/screen_intelligence_cli.rs--no-vision-model / --ocr-only flags on run and start; updated help text and doctor output
  • app/src/utils/tauriCommands/ — TS types updated
  • app/src/components/settings/panels/ScreenIntelligencePanel.tsx — UI toggle
  • Test fixtures — Added use_vision_model to all AccessibilityConfig mocks

Test plan

  • cargo check passes
  • tsc --noEmit passes
  • cargo run screen-intelligence run --no-vision-model -v skips vision LLM, logs use_vision_model=false
  • cargo run screen-intelligence run -v (without flag) uses vision LLM as before
  • Settings panel toggle persists and takes effect on next capture
  • cargo run screen-intelligence doctor shows use_vision_model in config output

Summary by CodeRabbit

New Features

  • Added "Use Vision Model" toggle in Screen Intelligence settings to control whether visual analysis is enabled or disabled.
  • Added --ocr-only / --no-vision-model command-line options to disable vision model on startup.

…Panel

- Introduced a new checkbox in the Screen Intelligence Panel to toggle the use of a vision model for richer context extraction from screenshots.
- Updated state management to handle the new option and integrated it into the configuration and processing logic.
- Adjusted related tests and configurations to support the new feature, ensuring compatibility across the application.
- Introduced a new command-line option `--no-vision-model` to allow users to skip the vision model and use OCR and text LLM only.
- Updated the CLI options parsing to handle the new flag and modified the bootstrap logic to respect this setting.
- Enhanced usage documentation to reflect the new option and its alias `--ocr-only` for clarity.
…onfig

The processing worker was reading use_vision_model from the persisted
config file (Config::load_or_init), so the CLI --no-vision-model flag
had no effect. Now reads from the engine's in-memory runtime config
which the CLI correctly overrides via apply_config(). Also moves image
compression before OCR pass.
Add the new use_vision_model field to all AccessibilityConfig test
fixtures so TypeScript compilation passes. Also includes rustfmt
auto-fix for screen_intelligence_cli.rs.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 8, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a new use_vision_model configuration flag throughout the system. The flag is added to frontend settings UI, backend configuration schemas, CLI options, and the screen intelligence processing pipeline, enabling runtime control over whether vision model processing executes during frame analysis.

Changes

Cohort / File(s) Summary
Frontend Settings UI
app/src/components/settings/panels/ScreenIntelligencePanel.tsx
Added useVisionModel state and a new "Use Vision Model" checkbox in the Screen Intelligence Policy settings, wired to read from and persist config.use_vision_model.
Frontend Test Fixtures
app/src/components/intelligence/__tests__/ScreenIntelligenceDebugPanel.test.tsx, app/src/components/settings/panels/__tests__/AccessibilityPanel.test.tsx, app/src/components/settings/panels/__tests__/ScreenIntelligencePanel.test.tsx, app/src/pages/onboarding/steps/__tests__/ScreenPermissionsStep.test.tsx, app/src/services/__tests__/coreRpcClient.test.ts, app/src/store/__tests__/accessibilitySlice.test.ts
Updated test fixtures across multiple test suites to include use_vision_model: true in the mocked AccessibilityStatus configuration, ensuring test baseline data reflects the new config field.
TypeScript RPC & Config Types
app/src/utils/tauriCommands/accessibility.ts, app/src/utils/tauriCommands/config.ts
Extended AccessibilityConfig interface with use_vision_model: boolean field and added optional use_vision_model?: boolean | null parameter to ScreenIntelligenceSettingsUpdate for RPC payload.
Rust Config Schema
src/openhuman/config/schema/accessibility.rs
Added use_vision_model: bool field to ScreenIntelligenceConfig with serde default and Default implementation both returning true; introduced default_use_vision_model() helper function.
Rust Config Operations & Serialization
src/openhuman/config/ops.rs, src/openhuman/config/schemas.rs
Extended ScreenIntelligenceSettingsPatch to include optional use_vision_model field and updated apply_screen_intelligence_settings to apply the flag to the engine's configuration when provided.
Rust CLI & Server Bootstrap
src/core/screen_intelligence_cli.rs
Added --no-vision-model / --ocr-only CLI flags to CliOpts, introduced bootstrap_engine_with_opts(verbose, no_vision_model) helper, refactored bootstrap_engine to delegate to it with false, and propagated no_vision_model into config mutation and startup logging; updated help/usage strings and run_doctor output to include the flag.
Vision Processing Pipeline
src/openhuman/screen_intelligence/processing_worker.rs
Modified analyze_frame to read use_vision_model from engine config; conditionally skips Pass 2 (vision LLM context generation) when disabled, selects synthesis prompt based on vision context availability, adjusts VisionSummary confidence from 0.9 to 0.75 when vision disabled, and relocates image compression before Pass 1.

Sequence Diagram

sequenceDiagram
    participant User as Frontend User
    participant UI as Settings UI
    participant RPC as RPC Layer
    participant Config as Config Service
    participant Engine as Accessibility Engine
    participant Worker as Processing Worker

    User->>UI: Toggle "Use Vision Model"
    UI->>RPC: Call openhumanUpdateScreenIntelligenceSettings(use_vision_model)
    RPC->>Config: apply_screen_intelligence_settings(use_vision_model)
    Config->>Config: Update ScreenIntelligenceConfig.use_vision_model
    Config->>Engine: Reload config into global engine
    Note over Engine: Config updated with new flag
    
    Worker->>Engine: Read use_vision_model from engine.config
    alt use_vision_model == true
        Worker->>Worker: Run Pass 2: Vision LLM context
        Worker->>Worker: confidence = 0.9
    else use_vision_model == false
        Worker->>Worker: Skip Pass 2 (vision disabled)
        Worker->>Worker: confidence = 0.75
    end
    Worker->>Worker: Return VisionSummary with conditional processing
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • graycyrus

Poem

🐰 A toggle appears, oh what a sight!
Vision on, or OCR-only might,
The model now bows to config's decree,
Processing pipes flow wild and free,
One flag to rule the sight we see!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main feature: adding an OCR-only mode without a vision model to the screen intelligence system.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/openhuman/config/schemas.rs (1)

270-305: Missing use_vision_model in controller schema inputs.

The ScreenIntelligenceSettingsUpdate struct now includes use_vision_model, and the handler correctly forwards it to the patch. However, the controller schema metadata in schemas("update_screen_intelligence_settings") does not include this field in its inputs vector.

While the RPC will still work (serde deserializes based on the struct), the schema metadata used for documentation and introspection will be incomplete.

♻️ Proposed fix to add the missing schema input
             optional_bool("vision_enabled", "Enable vision analysis."),
             optional_bool("autocomplete_enabled", "Enable autocomplete integration."),
+            optional_bool("use_vision_model", "Use vision LLM for frame analysis (false = OCR-only mode)."),
             optional_bool("keep_screenshots", "Keep screenshots on disk after vision processing."),
             FieldSchema {
                 name: "allowlist",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/config/schemas.rs` around lines 270 - 305, The controller
schema for "update_screen_intelligence_settings" is missing the
"use_vision_model" input, causing documentation/schema introspection to be
incomplete; add an optional_bool field named "use_vision_model" with an
appropriate comment (e.g., "Use the vision model for analysis.") into the inputs
vector of the ControllerSchema for function
"update_screen_intelligence_settings" so the metadata matches the
ScreenIntelligenceSettingsUpdate struct and the handler forwarding logic.
src/openhuman/screen_intelligence/processing_worker.rs (1)

333-340: Consider: Fallback text produces leading newlines when no vision context.

When use_vision_model=false, fallback_text is "". If synthesis fails, line 339 produces "\n\n{ocr_truncated}" with leading blank lines.

Suggested tweak
-    let fallback_text = vision_context.as_deref().unwrap_or("");
     let synthesis = service
         .prompt(&config, &synthesis_prompt, Some(700), true)
         .await
         .unwrap_or_else(|e| {
             tracing::debug!("[processing_worker] synthesis failed, using fallback: {e}");
-            format!("{}\n\n{}", fallback_text, ocr_truncated)
+            match &vision_context {
+                Some(vc) => format!("{}\n\n{}", vc, ocr_truncated),
+                None => ocr_truncated.to_string(),
+            }
         });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/screen_intelligence/processing_worker.rs` around lines 333 -
340, The fallback construction can produce leading newlines when vision_context
is None because fallback_text is "", so adjust the synthesis error handler (the
closure for service.prompt(...).unwrap_or_else) to check fallback_text and avoid
prepending "\n\n" when it's empty; i.e., use conditional logic around
fallback_text (the variable defined above) to either return ocr_truncated alone
or format!("{}\n\n{}", fallback_text, ocr_truncated) when fallback_text is
non-empty so you don't get leading blank lines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/openhuman/config/schemas.rs`:
- Around line 270-305: The controller schema for
"update_screen_intelligence_settings" is missing the "use_vision_model" input,
causing documentation/schema introspection to be incomplete; add an
optional_bool field named "use_vision_model" with an appropriate comment (e.g.,
"Use the vision model for analysis.") into the inputs vector of the
ControllerSchema for function "update_screen_intelligence_settings" so the
metadata matches the ScreenIntelligenceSettingsUpdate struct and the handler
forwarding logic.

In `@src/openhuman/screen_intelligence/processing_worker.rs`:
- Around line 333-340: The fallback construction can produce leading newlines
when vision_context is None because fallback_text is "", so adjust the synthesis
error handler (the closure for service.prompt(...).unwrap_or_else) to check
fallback_text and avoid prepending "\n\n" when it's empty; i.e., use conditional
logic around fallback_text (the variable defined above) to either return
ocr_truncated alone or format!("{}\n\n{}", fallback_text, ocr_truncated) when
fallback_text is non-empty so you don't get leading blank lines.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7c8fbc95-e74f-4066-804f-f7547b921c71

📥 Commits

Reviewing files that changed from the base of the PR and between 3034ec1 and 7fb8506.

📒 Files selected for processing (14)
  • app/src/components/intelligence/__tests__/ScreenIntelligenceDebugPanel.test.tsx
  • app/src/components/settings/panels/ScreenIntelligencePanel.tsx
  • app/src/components/settings/panels/__tests__/AccessibilityPanel.test.tsx
  • app/src/components/settings/panels/__tests__/ScreenIntelligencePanel.test.tsx
  • app/src/pages/onboarding/steps/__tests__/ScreenPermissionsStep.test.tsx
  • app/src/services/__tests__/coreRpcClient.test.ts
  • app/src/store/__tests__/accessibilitySlice.test.ts
  • app/src/utils/tauriCommands/accessibility.ts
  • app/src/utils/tauriCommands/config.ts
  • src/core/screen_intelligence_cli.rs
  • src/openhuman/config/ops.rs
  • src/openhuman/config/schema/accessibility.rs
  • src/openhuman/config/schemas.rs
  • src/openhuman/screen_intelligence/processing_worker.rs

@senamakel senamakel merged commit cd2a4a9 into tinyhumansai:main Apr 8, 2026
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant