feat(screen-intelligence): OCR-only mode without vision model by senamakel · Pull Request #424 · tinyhumansai/openhuman

senamakel · 2026-04-08T08:06:15Z

Summary

Adds a use_vision_model config flag (default: true) to screen intelligence. When false, the vision LLM pass is skipped — only Apple Vision OCR feeds into a text-only synthesis LLM. No vision-capable model required, faster processing.
Adds --no-vision-model / --ocr-only CLI flags for openhuman screen-intelligence run and start subcommands.
Adds "Use Vision Model" toggle to the Screen Intelligence settings panel in the UI.
Fixes a bug where the processing worker read use_vision_model from the persisted config file instead of the engine's runtime config, so CLI overrides had no effect.
Image compression always runs regardless of the flag (before OCR pass).

Changes

src/openhuman/config/schema/accessibility.rs — New use_vision_model field on ScreenIntelligenceConfig
src/openhuman/screen_intelligence/processing_worker.rs — Reads flag from engine runtime config; skips vision LLM pass when false; uses OCR-only synthesis prompt; compression runs unconditionally before OCR
src/openhuman/config/ops.rs + schemas.rs — Wired through settings patch and RPC handler
src/core/screen_intelligence_cli.rs — --no-vision-model / --ocr-only flags on run and start; updated help text and doctor output
app/src/utils/tauriCommands/ — TS types updated
app/src/components/settings/panels/ScreenIntelligencePanel.tsx — UI toggle
Test fixtures — Added use_vision_model to all AccessibilityConfig mocks

Test plan

cargo check passes
tsc --noEmit passes
cargo run screen-intelligence run --no-vision-model -v skips vision LLM, logs use_vision_model=false
cargo run screen-intelligence run -v (without flag) uses vision LLM as before
Settings panel toggle persists and takes effect on next capture
cargo run screen-intelligence doctor shows use_vision_model in config output

Summary by CodeRabbit

New Features

Added "Use Vision Model" toggle in Screen Intelligence settings to control whether visual analysis is enabled or disabled.
Added --ocr-only / --no-vision-model command-line options to disable vision model on startup.

…Panel - Introduced a new checkbox in the Screen Intelligence Panel to toggle the use of a vision model for richer context extraction from screenshots. - Updated state management to handle the new option and integrated it into the configuration and processing logic. - Adjusted related tests and configurations to support the new feature, ensuring compatibility across the application.

- Introduced a new command-line option `--no-vision-model` to allow users to skip the vision model and use OCR and text LLM only. - Updated the CLI options parsing to handle the new flag and modified the bootstrap logic to respect this setting. - Enhanced usage documentation to reflect the new option and its alias `--ocr-only` for clarity.

…onfig The processing worker was reading use_vision_model from the persisted config file (Config::load_or_init), so the CLI --no-vision-model flag had no effect. Now reads from the engine's in-memory runtime config which the CLI correctly overrides via apply_config(). Also moves image compression before OCR pass.

Add the new use_vision_model field to all AccessibilityConfig test fixtures so TypeScript compilation passes. Also includes rustfmt auto-fix for screen_intelligence_cli.rs.

coderabbitai · 2026-04-08T08:06:35Z

📝 Walkthrough

Walkthrough

This pull request introduces a new use_vision_model configuration flag throughout the system. The flag is added to frontend settings UI, backend configuration schemas, CLI options, and the screen intelligence processing pipeline, enabling runtime control over whether vision model processing executes during frame analysis.

Changes

Cohort / File(s)	Summary
Frontend Settings UI `app/src/components/settings/panels/ScreenIntelligencePanel.tsx`	Added `useVisionModel` state and a new "Use Vision Model" checkbox in the Screen Intelligence Policy settings, wired to read from and persist `config.use_vision_model`.
Frontend Test Fixtures `app/src/components/intelligence/__tests__/ScreenIntelligenceDebugPanel.test.tsx`, `app/src/components/settings/panels/__tests__/AccessibilityPanel.test.tsx`, `app/src/components/settings/panels/__tests__/ScreenIntelligencePanel.test.tsx`, `app/src/pages/onboarding/steps/__tests__/ScreenPermissionsStep.test.tsx`, `app/src/services/__tests__/coreRpcClient.test.ts`, `app/src/store/__tests__/accessibilitySlice.test.ts`	Updated test fixtures across multiple test suites to include `use_vision_model: true` in the mocked `AccessibilityStatus` configuration, ensuring test baseline data reflects the new config field.
TypeScript RPC & Config Types `app/src/utils/tauriCommands/accessibility.ts`, `app/src/utils/tauriCommands/config.ts`	Extended `AccessibilityConfig` interface with `use_vision_model: boolean` field and added optional `use_vision_model?: boolean \| null` parameter to `ScreenIntelligenceSettingsUpdate` for RPC payload.
Rust Config Schema `src/openhuman/config/schema/accessibility.rs`	Added `use_vision_model: bool` field to `ScreenIntelligenceConfig` with serde default and `Default` implementation both returning `true`; introduced `default_use_vision_model()` helper function.
Rust Config Operations & Serialization `src/openhuman/config/ops.rs`, `src/openhuman/config/schemas.rs`	Extended `ScreenIntelligenceSettingsPatch` to include optional `use_vision_model` field and updated `apply_screen_intelligence_settings` to apply the flag to the engine's configuration when provided.
Rust CLI & Server Bootstrap `src/core/screen_intelligence_cli.rs`	Added `--no-vision-model` / `--ocr-only` CLI flags to `CliOpts`, introduced `bootstrap_engine_with_opts(verbose, no_vision_model)` helper, refactored `bootstrap_engine` to delegate to it with `false`, and propagated `no_vision_model` into config mutation and startup logging; updated help/usage strings and `run_doctor` output to include the flag.
Vision Processing Pipeline `src/openhuman/screen_intelligence/processing_worker.rs`	Modified `analyze_frame` to read `use_vision_model` from engine config; conditionally skips Pass 2 (vision LLM context generation) when disabled, selects synthesis prompt based on vision context availability, adjusts `VisionSummary` confidence from `0.9` to `0.75` when vision disabled, and relocates image compression before Pass 1.

Sequence Diagram

sequenceDiagram
    participant User as Frontend User
    participant UI as Settings UI
    participant RPC as RPC Layer
    participant Config as Config Service
    participant Engine as Accessibility Engine
    participant Worker as Processing Worker

    User->>UI: Toggle "Use Vision Model"
    UI->>RPC: Call openhumanUpdateScreenIntelligenceSettings(use_vision_model)
    RPC->>Config: apply_screen_intelligence_settings(use_vision_model)
    Config->>Config: Update ScreenIntelligenceConfig.use_vision_model
    Config->>Engine: Reload config into global engine
    Note over Engine: Config updated with new flag
    
    Worker->>Engine: Read use_vision_model from engine.config
    alt use_vision_model == true
        Worker->>Worker: Run Pass 2: Vision LLM context
        Worker->>Worker: confidence = 0.9
    else use_vision_model == false
        Worker->>Worker: Skip Pass 2 (vision disabled)
        Worker->>Worker: confidence = 0.75
    end
    Worker->>Worker: Return VisionSummary with conditional processing

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat(screen-intelligence): standalone server with OCR + vision pipeline #382: Extends screen intelligence with the use_vision_model config flag and conditionally skips the vision pass in processing_worker::analyze_frame based on engine configuration.
Feat/343 screen intelligence e2e tests #359: Modifies screen-intelligence vision pipeline and configuration surface; both PRs affect the same runtime behavior and engine-level configuration application.
fix: stabilize screen intelligence screenshot capture and settings flow #275: Updates the same frontend test files and Screen Intelligence settings UI components across multiple test suites.

Suggested reviewers

graycyrus

Poem

🐰 A toggle appears, oh what a sight!
Vision on, or OCR-only might,
The model now bows to config's decree,
Processing pipes flow wild and free,
One flag to rule the sight we see!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main feature: adding an OCR-only mode without a vision model to the screen intelligence system.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

src/openhuman/config/schemas.rs (1)

270-305: Missing use_vision_model in controller schema inputs.

The ScreenIntelligenceSettingsUpdate struct now includes use_vision_model, and the handler correctly forwards it to the patch. However, the controller schema metadata in schemas("update_screen_intelligence_settings") does not include this field in its inputs vector.

While the RPC will still work (serde deserializes based on the struct), the schema metadata used for documentation and introspection will be incomplete.
♻️ Proposed fix to add the missing schema input
             optional_bool("vision_enabled", "Enable vision analysis."),
             optional_bool("autocomplete_enabled", "Enable autocomplete integration."),
+            optional_bool("use_vision_model", "Use vision LLM for frame analysis (false = OCR-only mode)."),
             optional_bool("keep_screenshots", "Keep screenshots on disk after vision processing."),
             FieldSchema {
                 name: "allowlist",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/config/schemas.rs` around lines 270 - 305, The controller
schema for "update_screen_intelligence_settings" is missing the
"use_vision_model" input, causing documentation/schema introspection to be
incomplete; add an optional_bool field named "use_vision_model" with an
appropriate comment (e.g., "Use the vision model for analysis.") into the inputs
vector of the ControllerSchema for function
"update_screen_intelligence_settings" so the metadata matches the
ScreenIntelligenceSettingsUpdate struct and the handler forwarding logic.

src/openhuman/screen_intelligence/processing_worker.rs (1)

333-340: Consider: Fallback text produces leading newlines when no vision context.

When use_vision_model=false, fallback_text is "". If synthesis fails, line 339 produces "\n\n{ocr_truncated}" with leading blank lines.

Suggested tweak

-    let fallback_text = vision_context.as_deref().unwrap_or("");
     let synthesis = service
         .prompt(&config, &synthesis_prompt, Some(700), true)
         .await
         .unwrap_or_else(|e| {
             tracing::debug!("[processing_worker] synthesis failed, using fallback: {e}");
-            format!("{}\n\n{}", fallback_text, ocr_truncated)
+            match &vision_context {
+                Some(vc) => format!("{}\n\n{}", vc, ocr_truncated),
+                None => ocr_truncated.to_string(),
+            }
         });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/screen_intelligence/processing_worker.rs` around lines 333 -
340, The fallback construction can produce leading newlines when vision_context
is None because fallback_text is "", so adjust the synthesis error handler (the
closure for service.prompt(...).unwrap_or_else) to check fallback_text and avoid
prepending "\n\n" when it's empty; i.e., use conditional logic around
fallback_text (the variable defined above) to either return ocr_truncated alone
or format!("{}\n\n{}", fallback_text, ocr_truncated) when fallback_text is
non-empty so you don't get leading blank lines.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/openhuman/config/schemas.rs`:
- Around line 270-305: The controller schema for
"update_screen_intelligence_settings" is missing the "use_vision_model" input,
causing documentation/schema introspection to be incomplete; add an
optional_bool field named "use_vision_model" with an appropriate comment (e.g.,
"Use the vision model for analysis.") into the inputs vector of the
ControllerSchema for function "update_screen_intelligence_settings" so the
metadata matches the ScreenIntelligenceSettingsUpdate struct and the handler
forwarding logic.

In `@src/openhuman/screen_intelligence/processing_worker.rs`:
- Around line 333-340: The fallback construction can produce leading newlines
when vision_context is None because fallback_text is "", so adjust the synthesis
error handler (the closure for service.prompt(...).unwrap_or_else) to check
fallback_text and avoid prepending "\n\n" when it's empty; i.e., use conditional
logic around fallback_text (the variable defined above) to either return
ocr_truncated alone or format!("{}\n\n{}", fallback_text, ocr_truncated) when
fallback_text is non-empty so you don't get leading blank lines.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7c8fbc95-e74f-4066-804f-f7547b921c71

📥 Commits

Reviewing files that changed from the base of the PR and between 3034ec1 and 7fb8506.

📒 Files selected for processing (14)

app/src/components/intelligence/__tests__/ScreenIntelligenceDebugPanel.test.tsx
app/src/components/settings/panels/ScreenIntelligencePanel.tsx
app/src/components/settings/panels/__tests__/AccessibilityPanel.test.tsx
app/src/components/settings/panels/__tests__/ScreenIntelligencePanel.test.tsx
app/src/pages/onboarding/steps/__tests__/ScreenPermissionsStep.test.tsx
app/src/services/__tests__/coreRpcClient.test.ts
app/src/store/__tests__/accessibilitySlice.test.ts
app/src/utils/tauriCommands/accessibility.ts
app/src/utils/tauriCommands/config.ts
src/core/screen_intelligence_cli.rs
src/openhuman/config/ops.rs
src/openhuman/config/schema/accessibility.rs
src/openhuman/config/schemas.rs
src/openhuman/screen_intelligence/processing_worker.rs

senamakel added 4 commits April 8, 2026 00:47

fix: add use_vision_model to test fixtures and fix rustfmt

7fb8506

Add the new use_vision_model field to all AccessibilityConfig test fixtures so TypeScript compilation passes. Also includes rustfmt auto-fix for screen_intelligence_cli.rs.

coderabbitai Bot reviewed Apr 8, 2026

View reviewed changes

senamakel merged commit cd2a4a9 into tinyhumansai:main Apr 8, 2026
8 of 9 checks passed

coderabbitai Bot mentioned this pull request Apr 14, 2026

feat: disable permission auto-prompts + typing indicator on webhook channels #552

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(screen-intelligence): OCR-only mode without vision model#424

feat(screen-intelligence): OCR-only mode without vision model#424
senamakel merged 4 commits intotinyhumansai:mainfrom
senamakel:feat/vision-model

senamakel commented Apr 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 8, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

senamakel commented Apr 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Summary by CodeRabbit

New Features

Uh oh!

coderabbitai Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

senamakel commented Apr 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 8, 2026 •

edited

Loading