Skip to content

feat(vlm): add perform_ocr and understand_video actions#2

Draft
AlanAAG wants to merge 74 commits into
mainfrom
fix/ocr-video-clean
Draft

feat(vlm): add perform_ocr and understand_video actions#2
AlanAAG wants to merge 74 commits into
mainfrom
fix/ocr-video-clean

Conversation

@AlanAAG
Copy link
Copy Markdown
Owner

@AlanAAG AlanAAG commented Apr 21, 2026

Addresses all reviewer comments on CraftOS-dev/CraftBot#196.

This branch is built directly on top of upstream/dev (CraftOS-dev/CraftBot:dev) and contains exactly 6 files, 474 insertions, 19 deletions.


Reviewer issues resolved

1. generate_multimodal unified (ahmad-ajmal)

GeminiClient.generate_multimodal() now accepts image_bytes_list: List[bytes] instead of a single image_bytes. Single-image callers pass a one-element list. No separate multi-image variant needed.

2. describe_image_ocr deduplication (ahmad-ajmal)

describe_image_ocr() delegates directly to describe_image_bytes() with an OCR system prompt and json_mode=False. No duplicated provider routing, token counting, or cleanup logic.

3. Separate _openai_describe_bytes_plain removed (ahmad-ajmal)

json_mode flag folded into _openai_describe_bytes(). The plain variant is gone.

4. Model not hardcoded in understand_video (ahmad-ajmal)

Calls get_vlm_model() with "gemini-1.5-pro" only as a last-resort fallback.

5. execute = alias removed

execute = perform_ocr / execute = understand_video lines removed. Not used by any other action in the codebase.

6. Video fallback strategy (zfoong)

understand_video uses Gemini native video API when a Google API key is present; falls back to OpenCV keyframe extraction otherwise. Mirrors the pattern from generate_image.

7. DeepSeek silent failure fixed

describe_image_bytes() now raises RuntimeError for DeepSeek (which has no VLM support) instead of returning an empty string.


Files changed (6 only)

File Change
agent_core/core/llm/google_gemini_client.py generate_multimodal() accepts image_bytes_list
agent_core/core/impl/vlm/interface.py json_mode on describe_image_bytes/_openai_describe_bytes; new describe_image_ocr(), describe_video_frames(), _gemini_describe_video_frames(), _multi_frame_describe_fallback()
app/internal_action_interface.py perform_ocr() and understand_video() bridge methods
app/data/action/perform_ocr.py New action
app/data/action/understand_video.py New action
requirements.txt opencv-python-headless added

What is NOT in this PR

  • No settings.json, mcp_config.json, or skills_config.json changes
  • No from __future__ import annotations deletions or additions
  • No unrelated features (grep rewrite, skill management, onboarding, etc.)
  • No agent_base.py or main.py changes

How to update PR CraftOS-dev#196

Since this repo (AlanAAG/CraftBot) is a fork of CraftOS-dev/CraftBot, to update PR CraftOS-dev#196 in place:

git push origin fix/ocr-video-clean:feature/ocr-video-actions --force-with-lease

Then on GitHub, PR CraftOS-dev#196's head branch will update automatically.

Open in Web Open in Cursor 

zfoong and others added 30 commits April 13, 2026 13:15
Added name character limit to 20
There were two bugs fixed:

Bug 1:
`prompt_cache_key` is an OpenAI-specific routing hint in `extra_body`. xAI’s API ignores it, so it doesn’t help with cache routing for Grok.
So I skipped it when `self.provider == "grok"`.

Bug 2:
Wrong field was used for reading cached tokens:
* OpenAI → `usage.prompt_tokens_details.cached_tokens`
* Grok (xAI) → `usage.prompt_cache_hit_tokens`

----------------
The code was always reading the OpenAI field, so Grok always returned 0 cached tokens, making it look like every call was a full cache miss. I fixed this by branching on `self.provider == "grok"` to read the correct field.
Additionally, I updated the cache metrics log to show the actual provider name (grok, openai, etc.).
----------------

I updated the fixed in the same branch " feature/CLI"
zfoong and others added 30 commits April 15, 2026 18:57
undo changes on prompt_sanitizer
Reset max actions per task to normal rate.
- Install Issues on Mac fixed
- Python compatibility &  Syntax Issues fixed
-  CLI skills updated
- Local LLM compatibility Issues fixed
-  Image action error fixed
Tasks now track the platform they were started on (Task.source_platform),
and do_chat/do_chat_with_attachments resolve the outbound platform from
that field via session_id, falling back to the user's Preferred Messaging
Platform (read from USER.md, defaulting to "CraftBot Interface"). When a
running task receives a new message from a different platform, it switches
source_platform so subsequent replies follow the user. Also fixes the
USER.md template which was missing the Preferred Messaging Platform
placeholder, causing onboarding to silently drop the selected value.
Update version from 1.2.2 to 1.2.3 in settings config
- json_mode flag on describe_image_bytes / _openai_describe_bytes;
  removes the need for a separate _openai_describe_bytes_plain helper
- describe_image_ocr delegates to describe_image_bytes with OCR system
  prompt and json_mode=False (no duplicated provider routing)
- describe_video_frames with Gemini native multi-image path and
  universal per-frame fallback synthesiser for other providers
- _gemini_describe_video_frames and _multi_frame_describe_fallback helpers
- generate_multimodal accepts image_bytes_list instead of single bytes,
  unifying single and multi-image code paths
- _gemini_describe_bytes updated to pass [image_bytes] list
- DeepSeek raises RuntimeError for VLM calls instead of silent failure
- perform_ocr and understand_video action files (no execute= alias)
- perform_ocr / understand_video bridge methods in InternalActionInterface
- opencv-python-headless added to requirements.txt

Resolves review issues from CraftOS-dev#196

Co-authored-by: Alan Ayala <AlanAAG@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants