feat(vlm): add perform_ocr and understand_video actions#2
Draft
AlanAAG wants to merge 74 commits into
Draft
Conversation
Added name character limit to 20
There were two bugs fixed: Bug 1: `prompt_cache_key` is an OpenAI-specific routing hint in `extra_body`. xAI’s API ignores it, so it doesn’t help with cache routing for Grok. So I skipped it when `self.provider == "grok"`. Bug 2: Wrong field was used for reading cached tokens: * OpenAI → `usage.prompt_tokens_details.cached_tokens` * Grok (xAI) → `usage.prompt_cache_hit_tokens` ---------------- The code was always reading the OpenAI field, so Grok always returned 0 cached tokens, making it look like every call was a full cache miss. I fixed this by branching on `self.provider == "grok"` to read the correct field. Additionally, I updated the cache metrics log to show the actual provider name (grok, openai, etc.). ---------------- I updated the fixed in the same branch " feature/CLI"
…n-upgrade Improvement/action upgrade
…into feature/CLI
undo changes on prompt_sanitizer
Reset max actions per task to normal rate.
…t-update Feature/task limit update
- Install Issues on Mac fixed - Python compatibility & Syntax Issues fixed - CLI skills updated - Local LLM compatibility Issues fixed - Image action error fixed
Tasks now track the platform they were started on (Task.source_platform), and do_chat/do_chat_with_attachments resolve the outbound platform from that field via session_id, falling back to the user's Preferred Messaging Platform (read from USER.md, defaulting to "CraftBot Interface"). When a running task receives a new message from a different platform, it switches source_platform so subsequent replies follow the user. Also fixes the USER.md template which was missing the Preferred Messaging Platform placeholder, causing onboarding to silently drop the selected value.
Update version from 1.2.2 to 1.2.3 in settings config
- json_mode flag on describe_image_bytes / _openai_describe_bytes; removes the need for a separate _openai_describe_bytes_plain helper - describe_image_ocr delegates to describe_image_bytes with OCR system prompt and json_mode=False (no duplicated provider routing) - describe_video_frames with Gemini native multi-image path and universal per-frame fallback synthesiser for other providers - _gemini_describe_video_frames and _multi_frame_describe_fallback helpers - generate_multimodal accepts image_bytes_list instead of single bytes, unifying single and multi-image code paths - _gemini_describe_bytes updated to pass [image_bytes] list - DeepSeek raises RuntimeError for VLM calls instead of silent failure - perform_ocr and understand_video action files (no execute= alias) - perform_ocr / understand_video bridge methods in InternalActionInterface - opencv-python-headless added to requirements.txt Resolves review issues from CraftOS-dev#196 Co-authored-by: Alan Ayala <AlanAAG@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses all reviewer comments on CraftOS-dev/CraftBot#196.
This branch is built directly on top of
upstream/dev(CraftOS-dev/CraftBot:dev) and contains exactly 6 files, 474 insertions, 19 deletions.Reviewer issues resolved
1.
generate_multimodalunified (ahmad-ajmal)GeminiClient.generate_multimodal()now acceptsimage_bytes_list: List[bytes]instead of a singleimage_bytes. Single-image callers pass a one-element list. No separate multi-image variant needed.2.
describe_image_ocrdeduplication (ahmad-ajmal)describe_image_ocr()delegates directly todescribe_image_bytes()with an OCR system prompt andjson_mode=False. No duplicated provider routing, token counting, or cleanup logic.3. Separate
_openai_describe_bytes_plainremoved (ahmad-ajmal)json_modeflag folded into_openai_describe_bytes(). Theplainvariant is gone.4. Model not hardcoded in
understand_video(ahmad-ajmal)Calls
get_vlm_model()with"gemini-1.5-pro"only as a last-resort fallback.5.
execute =alias removedexecute = perform_ocr/execute = understand_videolines removed. Not used by any other action in the codebase.6. Video fallback strategy (zfoong)
understand_videouses Gemini native video API when a Google API key is present; falls back to OpenCV keyframe extraction otherwise. Mirrors the pattern fromgenerate_image.7. DeepSeek silent failure fixed
describe_image_bytes()now raisesRuntimeErrorfor DeepSeek (which has no VLM support) instead of returning an empty string.Files changed (6 only)
agent_core/core/llm/google_gemini_client.pygenerate_multimodal()acceptsimage_bytes_listagent_core/core/impl/vlm/interface.pyjson_modeondescribe_image_bytes/_openai_describe_bytes; newdescribe_image_ocr(),describe_video_frames(),_gemini_describe_video_frames(),_multi_frame_describe_fallback()app/internal_action_interface.pyperform_ocr()andunderstand_video()bridge methodsapp/data/action/perform_ocr.pyapp/data/action/understand_video.pyrequirements.txtopencv-python-headlessaddedWhat is NOT in this PR
settings.json,mcp_config.json, orskills_config.jsonchangesfrom __future__ import annotationsdeletions or additionsagent_base.pyormain.pychangesHow to update PR CraftOS-dev#196
Since this repo (
AlanAAG/CraftBot) is a fork ofCraftOS-dev/CraftBot, to update PR CraftOS-dev#196 in place:Then on GitHub, PR CraftOS-dev#196's head branch will update automatically.