feat(vlm): add perform_ocr and understand_video actions by AlanAAG · Pull Request #2 · AlanAAG/CraftBot

AlanAAG · 2026-04-21T18:53:30Z

Addresses all reviewer comments on CraftOS-dev/CraftBot#196.

This branch is built directly on top of upstream/dev (CraftOS-dev/CraftBot:dev) and contains exactly 6 files, 474 insertions, 19 deletions.

Reviewer issues resolved

1. `generate_multimodal` unified (ahmad-ajmal)

GeminiClient.generate_multimodal() now accepts image_bytes_list: List[bytes] instead of a single image_bytes. Single-image callers pass a one-element list. No separate multi-image variant needed.

2. `describe_image_ocr` deduplication (ahmad-ajmal)

describe_image_ocr() delegates directly to describe_image_bytes() with an OCR system prompt and json_mode=False. No duplicated provider routing, token counting, or cleanup logic.

3. Separate `_openai_describe_bytes_plain` removed (ahmad-ajmal)

json_mode flag folded into _openai_describe_bytes(). The plain variant is gone.

4. Model not hardcoded in `understand_video` (ahmad-ajmal)

Calls get_vlm_model() with "gemini-1.5-pro" only as a last-resort fallback.

5. `execute =` alias removed

execute = perform_ocr / execute = understand_video lines removed. Not used by any other action in the codebase.

6. Video fallback strategy (zfoong)

understand_video uses Gemini native video API when a Google API key is present; falls back to OpenCV keyframe extraction otherwise. Mirrors the pattern from generate_image.

7. DeepSeek silent failure fixed

describe_image_bytes() now raises RuntimeError for DeepSeek (which has no VLM support) instead of returning an empty string.

Files changed (6 only)

File	Change
`agent_core/core/llm/google_gemini_client.py`	`generate_multimodal()` accepts `image_bytes_list`
`agent_core/core/impl/vlm/interface.py`	`json_mode` on `describe_image_bytes`/`_openai_describe_bytes`; new `describe_image_ocr()`, `describe_video_frames()`, `_gemini_describe_video_frames()`, `_multi_frame_describe_fallback()`
`app/internal_action_interface.py`	`perform_ocr()` and `understand_video()` bridge methods
`app/data/action/perform_ocr.py`	New action
`app/data/action/understand_video.py`	New action
`requirements.txt`	`opencv-python-headless` added

What is NOT in this PR

No settings.json, mcp_config.json, or skills_config.json changes
No from __future__ import annotations deletions or additions
No unrelated features (grep rewrite, skill management, onboarding, etc.)
No agent_base.py or main.py changes

How to update PR CraftOS-dev#196

Since this repo (AlanAAG/CraftBot) is a fork of CraftOS-dev/CraftBot, to update PR CraftOS-dev#196 in place:

git push origin fix/ocr-video-clean:feature/ocr-video-actions --force-with-lease

Then on GitHub, PR CraftOS-dev#196's head branch will update automatically.

Added name character limit to 20

There were two bugs fixed: Bug 1: `prompt_cache_key` is an OpenAI-specific routing hint in `extra_body`. xAI’s API ignores it, so it doesn’t help with cache routing for Grok. So I skipped it when `self.provider == "grok"`. Bug 2: Wrong field was used for reading cached tokens: * OpenAI → `usage.prompt_tokens_details.cached_tokens` * Grok (xAI) → `usage.prompt_cache_hit_tokens` ---------------- The code was always reading the OpenAI field, so Grok always returned 0 cached tokens, making it look like every call was a full cache miss. I fixed this by branching on `self.provider == "grok"` to read the correct field. Additionally, I updated the cache metrics log to show the actual provider name (grok, openai, etc.). ---------------- I updated the fixed in the same branch " feature/CLI"

…n-upgrade Improvement/action upgrade

… V1.2.3

…into feature/CLI

undo changes on prompt_sanitizer

state resolution

Reset max actions per task to normal rate.

…t-update Feature/task limit update

… V1.2.3

- Install Issues on Mac fixed - Python compatibility & Syntax Issues fixed - CLI skills updated - Local LLM compatibility Issues fixed - Image action error fixed

… V1.2.3

Tasks now track the platform they were started on (Task.source_platform), and do_chat/do_chat_with_attachments resolve the outbound platform from that field via session_id, falling back to the user's Preferred Messaging Platform (read from USER.md, defaulting to "CraftBot Interface"). When a running task receives a new message from a different platform, it switches source_platform so subsequent replies follow the user. Also fixes the USER.md template which was missing the Preferred Messaging Platform placeholder, causing onboarding to silently drop the selected value.

V1.2.3

Dev

Staging

Feature/cli

Update version from 1.2.2 to 1.2.3 in settings config

- json_mode flag on describe_image_bytes / _openai_describe_bytes; removes the need for a separate _openai_describe_bytes_plain helper - describe_image_ocr delegates to describe_image_bytes with OCR system prompt and json_mode=False (no duplicated provider routing) - describe_video_frames with Gemini native multi-image path and universal per-frame fallback synthesiser for other providers - _gemini_describe_video_frames and _multi_frame_describe_fallback helpers - generate_multimodal accepts image_bytes_list instead of single bytes, unifying single and multi-image code paths - _gemini_describe_bytes updated to pass [image_bytes] list - DeepSeek raises RuntimeError for VLM calls instead of silent failure - perform_ocr and understand_video action files (no execute= alias) - perform_ocr / understand_video bridge methods in InternalActionInterface - opencv-python-headless added to requirements.txt Resolves review issues from CraftOS-dev#196 Co-authored-by: Alan Ayala <AlanAAG@users.noreply.github.com>

zfoong and others added 30 commits April 13, 2026 13:15

bug:heartbeat to avoid websocket closing issue

2d5dd9c

CLI Anything added

9c99622

bug:fix provider VLM issue

d8a4fe4

Added name character limit to 20

5d121ea

Added name character limit to 20

Add CLI-Anything integration to crafbot

f85346b

Invoke skill with command

23abdcf

improvement:update trigger priority (simple > complex task)

cf511d6

improvement:CWD included in environment prompt

786cfc4

Delete craftbot.log

894b732

Delete craftbot.pid

fb898a8

Delete craftbot.pid

26da4b9

Delete craftbot.log

403398f

Delete craftbot.log

aa4fae2

Delete craftbot.pid

a572d74

Delete craftbot.log

eee6371

Delete craftbot.pid

e74ec36

Update anthrophic default model to 4.5

b9efe54

improve generate image action

f5e791c

improvement:improve pretty json output and grep_files action

75c3458

add list_skills and use_skill actions, improved web_fetch action

4d4d3fe

Merge pull request CraftOS-dev#194 from CraftOS-dev/improvement/actio…

767fc1f

…n-upgrade Improvement/action upgrade

bug:reinforce wait for user reply message end with question

1989eb6

Merge branch 'V1.2.3' of https://github.com/craftos-dev/craftbot into…

d28902a

… V1.2.3

CLI SKILLS Improvements

5feaa80

Merge branch 'feature/CLI' of https://github.com/CraftOS-dev/CraftBot …

7df6625

…into feature/CLI

cli anything help guid added

c49fe56

feature:token limit handling with interface update

4fea2d6

bug:update deprecated test model and move them to config

c450866

Merge branch 'main' of https://github.com/craftos-dev/craftbot

1f2ebd5

zfoong and others added 30 commits April 15, 2026 18:57

Update prompt_sanitizer.py

624f3df

undo changes on prompt_sanitizer

improvement:onboarding process update

9d62935

Further UX improvement

8b4c5b4

fix: anchor workspace root to absolute path to prevent CWD-relative

5ca80b8

state resolution

feature:improve design, UX, and fixed bug

d222492

Update types.py

c40141d

Reset max actions per task to normal rate.

Merge pull request CraftOS-dev#199 from CraftOS-dev/feature/task-limi…

af2b972

…t-update Feature/task limit update

Merge branch 'V1.2.3' of https://github.com/craftos-dev/craftbot into…

1b3be6a

… V1.2.3

CLI Skill updated

ec3dc67

feature:added agent avatar feature

478b9c1

Fix openAI API issue

7ebfde6

Fixed OpenAI VLM issue

aab9c0c

Major Issues are fixed

77d4d87

- Install Issues on Mac fixed - Python compatibility & Syntax Issues fixed - CLI skills updated - Local LLM compatibility Issues fixed - Image action error fixed

fix hard onboarding back method

ae6e82e

model switching no taking effect, issue 193

153e9fb

Merge branch 'V1.2.3' of https://github.com/craftos-dev/craftbot into…

a10fa86

… V1.2.3

Fix llm fail fail to recover issue

1a87f4e

Merge branch 'main' into staging

807f8f8

Merge branch 'staging' into dev

e9bc877

Merge branch 'dev' into V1.2.3

9d888c3

Copy if no USER.md during onboarding

92cd1d8

Merge pull request CraftOS-dev#203 from CraftOS-dev/V1.2.3

53fa35c

V1.2.3

Merge pull request CraftOS-dev#204 from CraftOS-dev/dev

748c44f

Dev

Merge pull request CraftOS-dev#205 from CraftOS-dev/staging

3d987b3

Staging

Merge branch 'main' into feature/CLI

9c5b41a

Merge pull request CraftOS-dev#201 from CraftOS-dev/feature/CLI

74aebad

Feature/cli

name limit on craftbot

3fbe092

Update settings.json

a17980b

Update version from 1.2.2 to 1.2.3 in settings config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vlm): add perform_ocr and understand_video actions#2

feat(vlm): add perform_ocr and understand_video actions#2
AlanAAG wants to merge 74 commits into
mainfrom
fix/ocr-video-clean

AlanAAG commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

AlanAAG commented Apr 21, 2026

Reviewer issues resolved

1. generate_multimodal unified (ahmad-ajmal)

2. describe_image_ocr deduplication (ahmad-ajmal)

3. Separate _openai_describe_bytes_plain removed (ahmad-ajmal)

4. Model not hardcoded in understand_video (ahmad-ajmal)

5. execute = alias removed

6. Video fallback strategy (zfoong)

7. DeepSeek silent failure fixed

Files changed (6 only)

What is NOT in this PR

How to update PR CraftOS-dev#196

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

1. `generate_multimodal` unified (ahmad-ajmal)

2. `describe_image_ocr` deduplication (ahmad-ajmal)

3. Separate `_openai_describe_bytes_plain` removed (ahmad-ajmal)

4. Model not hardcoded in `understand_video` (ahmad-ajmal)

5. `execute =` alias removed