[V1.2.3] Add dedicated OCR and video analysis actions (#155) by AlanAAG · Pull Request #196 · CraftOS-dev/CraftBot

AlanAAG · 2026-04-15T09:32:49Z

What changed

Added dedicated OCR action via perform_ocr
Added video analysis action via understand_video
Extended VLM interface with describe_image_ocr and describe_video_frames
Added _openai_describe_bytes_plain for raw text output (no JSON enforcement)
Extended GeminiClient.generate_multimodal to accept multiple image frames via image_bytes_list
Added bridge methods perform_ocr and understand_video in InternalActionInterface
Added VLM availability guard in describe_image action
Added opencv-python-headless to requirements for keyframe extraction

Closes #155

Added name character limit to 20

There were two bugs fixed: Bug 1: `prompt_cache_key` is an OpenAI-specific routing hint in `extra_body`. xAI’s API ignores it, so it doesn’t help with cache routing for Grok. So I skipped it when `self.provider == "grok"`. Bug 2: Wrong field was used for reading cached tokens: * OpenAI → `usage.prompt_tokens_details.cached_tokens` * Grok (xAI) → `usage.prompt_cache_hit_tokens` ---------------- The code was always reading the OpenAI field, so Grok always returned 0 cached tokens, making it look like every call was a full cache miss. I fixed this by branching on `self.provider == "grok"` to read the correct field. Additionally, I updated the cache metrics log to show the actual provider name (grok, openai, etc.). ---------------- I updated the fixed in the same branch " feature/CLI"

…into feature/CLI

zfoong · 2026-04-16T04:06:14Z

The current approach for video understanding can be improved. Currently, only Google's model/API supports video understanding (Qwen and Moonshot do too, but we haven't set them up as providers yet).

Here is what you can do for now: If the user has Google API key set up, then use the Gemini API to perform video understanding; if a Google API key is not set up, the agent can fall back to using your current approach (which might be expensive and not as effective). You can refer to the generate_image action and see how it is done.

I haven't tested the PR yet. Once this is improved, I will test the PR.

…d_video, OpenCV as fallback

- Install Issues on Mac fixed - Python compatibility & Syntax Issues fixed - CLI skills updated - Local LLM compatibility Issues fixed - Image action error fixed

…exceptions in describe_image_bytes

…r and understand_video

…ability guard

ahmad-ajmal · 2026-04-16T17:51:52Z

generate_multimodal_multi_image looks like it could be folded into generate_multimodal, the only difference is accepting a list of images instead of one. Could we update generate_multimodal to handle both cases? That way we avoid the duplicated payload/token logic.
describe_image_ocr seems to repeat most of describe_image_bytes (provider routing, token counting, cleanup). Would it work to call describe_image_bytes with the OCR system prompt and json_mode=False?
The model is hardcoded as "gemini-1.5-pro" in understand_video.py, should this pull from config (e.g. get_vlm_model()) to stay in sync with the rest of the VLM settings?

…model - Merge generate_multimodal_multi_image into generate_multimodal (image_bytes_list param) - Add json_mode param to describe_image_bytes; describe_image_ocr now a thin wrapper - understand_video pulls model from get_vlm_model() with gemini-1.5-pro fallback - Add test suites: gemini_client_multimodal, vlm_json_mode, ocr_wrapper, understand_video_model

V1.2.3

Dev

Staging

Feature/cli

Update version from 1.2.2 to 1.2.3 in settings config

ahmad-ajmal

Just one small fix and the merge conflicts

…d OCR logic

…inal merge

ahmad-ajmal · 2026-04-21T10:52:27Z

I don't you need to change 89 files in this PR

AlanAAG · 2026-04-21T11:01:36Z

@ahmad-ajmal, hey its so many files because I just deleted the future annotation call I used to test throughout the entire project

ahmad-ajmal · 2026-04-21T14:57:32Z

+    except Exception as e:
+        return {'status': 'error', 'summary': '', 'file_path': '', 'file_saved': False, 'message': str(e)}
+
+execute = perform_ocr


what is this line for? (same in understand_video)

ahmad-ajmal · 2026-04-21T15:41:25Z

        except Exception as e:
            logger.warning(f"[VLM] Failed to report usage: {e}")

+    def _openai_describe_bytes_plain(self, image_bytes: bytes, sys: str | None, usr: str) -> Dict[str, Any]:


almost identical to _openai_describe_bytes

ahmad-ajmal · 2026-04-21T15:43:56Z

@@ -1,4 +1,5 @@
 # -*- coding: utf-8 -*-
+from __future__ import annotations


very confusing change - is there a reason for this?

ahmad-ajmal · 2026-04-21T15:44:19Z

@@ -1,4 +1,5 @@
 # -*- coding: utf-8 -*-
+from __future__ import annotations


Same for here, is there a reason for this?

ahmad-ajmal · 2026-04-21T15:45:27Z

dont push this - dont want to enable MCPs by default

ahmad-ajmal · 2026-04-21T15:45:43Z

dont push this

ahmad-ajmal · 2026-04-21T15:46:07Z

dont push this - dont want skills enabled by default

ahmad-ajmal · 2026-04-21T15:52:04Z

@ahmad-ajmal, hey its so many files because I just deleted the future annotation call I used to test throughout the entire project

revert the from __future__ import annotations deletions. It's unnecessarily bulking up the PR and they have been there for a long time. Don't worry, you didn't add them.

…config, skills_config

…rstand_video

…be_bytes via json_mode param

…nnotations in log_events and profiler

…everted bulk deletion)

…tead of delegating entirely to InternalActionInterface

…IPTIONS

korivi-CraftOS and others added 10 commits April 13, 2026 16:02

CLI Anything added

9c99622

Added name character limit to 20

5d121ea

Added name character limit to 20

Add CLI-Anything integration to crafbot

f85346b

Delete craftbot.pid

26da4b9

Delete craftbot.log

403398f

feat: add OCR and video analysis actions (#155)

25f21c1

CLI SKILLS Improvements

5feaa80

Merge branch 'feature/CLI' of https://github.com/CraftOS-dev/CraftBot …

7df6625

…into feature/CLI

cli anything help guid added

c49fe56

AlanAAG requested review from ahmad-ajmal and zfoong April 15, 2026 09:32

korivi-CraftOS and others added 6 commits April 16, 2026 15:18

CLI Skill updated

ec3dc67

improvement: use Gemini native video API as primary path in understan…

fa0284e

…d_video, OpenCV as fallback

Major Issues are fixed

77d4d87

- Install Issues on Mac fixed - Python compatibility & Syntax Issues fixed - CLI skills updated - Local LLM compatibility Issues fixed - Image action error fixed

fix(vlm): remove response_format json_object from byteplus, re-raise …

6915894

…exceptions in describe_image_bytes

fix(actions): split action_sets string into proper list in perform_oc…

125cff4

…r and understand_video

fix: wire independent VLM provider/model/key resolution and add avail…

247ee92

…ability guard

AlanAAG and others added 8 commits April 17, 2026 16:40

Merge pull request #203 from CraftOS-dev/V1.2.3

53fa35c

V1.2.3

Merge pull request #204 from CraftOS-dev/dev

748c44f

Dev

Merge pull request #205 from CraftOS-dev/staging

3d987b3

Staging

Merge branch 'main' into feature/CLI

9c5b41a

Merge pull request #201 from CraftOS-dev/feature/CLI

74aebad

Feature/cli

name limit on craftbot

3fbe092

Update settings.json

a17980b

Update version from 1.2.2 to 1.2.3 in settings config

ahmad-ajmal reviewed Apr 20, 2026

View reviewed changes

Comment thread agent_core/core/llm/google_gemini_client.py Outdated

merge(vlm): resolve conflicts with dev branch, unify VLM providers an…

866601e

…d OCR logic

AlanAAG force-pushed the feature/ocr-video-actions branch from fb0be05 to 866601e Compare April 21, 2026 10:25

AlanAAG added 2 commits April 21, 2026 16:04

chore: remove test files from PR

6c0b2c2

update test deletion and future annotations for tests, prepared for f…

670e5f0

…inal merge

fix: unify mime-type to image/jpeg in generate_multimodal

fdf9171

ahmad-ajmal reviewed Apr 21, 2026

View reviewed changes

Comment thread app/config/mcp_config.json

Copy link
Copy Markdown

Collaborator

ahmad-ajmal Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont push this - dont want to enable MCPs by default

ahmad-ajmal reviewed Apr 21, 2026

View reviewed changes

Comment thread app/config/settings.json

Copy link
Copy Markdown

Collaborator

ahmad-ajmal Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont push this

ahmad-ajmal reviewed Apr 21, 2026

View reviewed changes

Comment thread app/config/skills_config.json

Copy link
Copy Markdown

Collaborator

ahmad-ajmal Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont push this - dont want skills enabled by default

This was referenced Apr 21, 2026

[SUPERSEDED] use fix/ocr-video-clean instead AlanAAG/CraftBot#1

Closed

feat(vlm): add perform_ocr and understand_video actions AlanAAG/CraftBot#2

Draft

AlanAAG added 9 commits April 22, 2026 00:51

chore(config): revert personal dev config changes from settings, mcp_…

29ee8a5

…config, skills_config

chore(actions): remove unused execute alias from perform_ocr and unde…

9552308

…rstand_video

refactor(vlm): merge _openai_describe_bytes_plain into _openai_descri…

5e0a957

…be_bytes via json_mode param

fix(decorators): restore correct position of from __future__ import a…

c97536e

…nnotations in log_events and profiler

chore: restore from __future__ import annotations across all files (r…

932faad

…everted bulk deletion)

chore: remove temporary restore script

76e8c29

add comment on understand_video.py explaining dual path execution ins…

b71dee2

…tead of delegating entirely to InternalActionInterface

add video to the action sets and define video in the DEFAULT_ST_DESCR…

ecac851

…IPTIONS

modify import from google.generativeai to the new supported google.genai

2448bf6

ahmad-ajmal approved these changes Apr 22, 2026

View reviewed changes

zfoong changed the base branch from dev to V1.3.0 April 22, 2026 11:09

zfoong reviewed Apr 22, 2026

View reviewed changes

Comment thread agent_core/core/llm/google_gemini_client.py

Comment thread agent_core/core/impl/vlm/interface.py

ahmad-ajmal merged commit 76c1f47 into V1.3.0 Apr 22, 2026

CraftOS-dev deleted the feature/ocr-video-actions branch April 26, 2026 02:55

		@@ -1,4 +1,5 @@
		# -- coding: utf-8 --
		from __future__ import annotations

Conversation

AlanAAG commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Uh oh!

zfoong commented Apr 16, 2026

Uh oh!

ahmad-ajmal commented Apr 16, 2026

Uh oh!

ahmad-ajmal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ahmad-ajmal commented Apr 21, 2026

Uh oh!

AlanAAG commented Apr 21, 2026

Uh oh!

ahmad-ajmal Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ahmad-ajmal Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ahmad-ajmal Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ahmad-ajmal Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ahmad-ajmal Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ahmad-ajmal Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ahmad-ajmal Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ahmad-ajmal commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AlanAAG commented Apr 15, 2026 •

edited

Loading

ahmad-ajmal commented Apr 21, 2026 •

edited

Loading