Skip to content

[V1.2.3] Add dedicated OCR and video analysis actions (#155)#196

Merged
ahmad-ajmal merged 37 commits into
V1.3.0from
feature/ocr-video-actions
Apr 22, 2026
Merged

[V1.2.3] Add dedicated OCR and video analysis actions (#155)#196
ahmad-ajmal merged 37 commits into
V1.3.0from
feature/ocr-video-actions

Conversation

@AlanAAG
Copy link
Copy Markdown
Collaborator

@AlanAAG AlanAAG commented Apr 15, 2026

What changed

  • Added dedicated OCR action via perform_ocr
  • Added video analysis action via understand_video
  • Extended VLM interface with describe_image_ocr and describe_video_frames
  • Added _openai_describe_bytes_plain for raw text output (no JSON enforcement)
  • Extended GeminiClient.generate_multimodal to accept multiple image frames via image_bytes_list
  • Added bridge methods perform_ocr and understand_video in InternalActionInterface
  • Added VLM availability guard in describe_image action
  • Added opencv-python-headless to requirements for keyframe extraction

Closes #155

korivi-CraftOS and others added 10 commits April 13, 2026 16:02
Added name character limit to 20
There were two bugs fixed:

Bug 1:
`prompt_cache_key` is an OpenAI-specific routing hint in `extra_body`. xAI’s API ignores it, so it doesn’t help with cache routing for Grok.
So I skipped it when `self.provider == "grok"`.

Bug 2:
Wrong field was used for reading cached tokens:
* OpenAI → `usage.prompt_tokens_details.cached_tokens`
* Grok (xAI) → `usage.prompt_cache_hit_tokens`

----------------
The code was always reading the OpenAI field, so Grok always returned 0 cached tokens, making it look like every call was a full cache miss. I fixed this by branching on `self.provider == "grok"` to read the correct field.
Additionally, I updated the cache metrics log to show the actual provider name (grok, openai, etc.).
----------------

I updated the fixed in the same branch " feature/CLI"
@AlanAAG AlanAAG requested review from ahmad-ajmal and zfoong April 15, 2026 09:32
@zfoong
Copy link
Copy Markdown
Collaborator

zfoong commented Apr 16, 2026

The current approach for video understanding can be improved. Currently, only Google's model/API supports video understanding (Qwen and Moonshot do too, but we haven't set them up as providers yet).

Here is what you can do for now: If the user has Google API key set up, then use the Gemini API to perform video understanding; if a Google API key is not set up, the agent can fall back to using your current approach (which might be expensive and not as effective). You can refer to the generate_image action and see how it is done.

I haven't tested the PR yet. Once this is improved, I will test the PR.

korivi-CraftOS and others added 6 commits April 16, 2026 15:18
- Install Issues on Mac fixed
- Python compatibility &  Syntax Issues fixed
-  CLI skills updated
- Local LLM compatibility Issues fixed
-  Image action error fixed
@ahmad-ajmal
Copy link
Copy Markdown
Collaborator

  1. generate_multimodal_multi_image looks like it could be folded into generate_multimodal, the only difference is accepting a list of images instead of one. Could we update generate_multimodal to handle both cases? That way we avoid the duplicated payload/token logic.
  2. describe_image_ocr seems to repeat most of describe_image_bytes (provider routing, token counting, cleanup). Would it work to call describe_image_bytes with the OCR system prompt and json_mode=False?
  3. The model is hardcoded as "gemini-1.5-pro" in understand_video.py, should this pull from config (e.g. get_vlm_model()) to stay in sync with the rest of the VLM settings?

AlanAAG and others added 8 commits April 17, 2026 16:40
…model

- Merge generate_multimodal_multi_image into generate_multimodal (image_bytes_list param)
- Add json_mode param to describe_image_bytes; describe_image_ocr now a thin wrapper
- understand_video pulls model from get_vlm_model() with gemini-1.5-pro fallback
- Add test suites: gemini_client_multimodal, vlm_json_mode, ocr_wrapper, understand_video_model
Update version from 1.2.2 to 1.2.3 in settings config
Copy link
Copy Markdown
Collaborator

@ahmad-ajmal ahmad-ajmal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small fix and the merge conflicts

Comment thread agent_core/core/llm/google_gemini_client.py Outdated
@AlanAAG AlanAAG force-pushed the feature/ocr-video-actions branch from fb0be05 to 866601e Compare April 21, 2026 10:25
@ahmad-ajmal
Copy link
Copy Markdown
Collaborator

I don't you need to change 89 files in this PR

@AlanAAG
Copy link
Copy Markdown
Collaborator Author

AlanAAG commented Apr 21, 2026

@ahmad-ajmal, hey its so many files because I just deleted the future annotation call I used to test throughout the entire project

Comment thread app/data/action/perform_ocr.py Outdated
except Exception as e:
return {'status': 'error', 'summary': '', 'file_path': '', 'file_saved': False, 'message': str(e)}

execute = perform_ocr
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this line for? (same in understand_video)

Comment thread agent_core/core/impl/vlm/interface.py Outdated
except Exception as e:
logger.warning(f"[VLM] Failed to report usage: {e}")

def _openai_describe_bytes_plain(self, image_bytes: bytes, sys: str | None, usr: str) -> Dict[str, Any]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost identical to _openai_describe_bytes

Comment thread agent_core/decorators/log_events.py Outdated
@@ -1,4 +1,5 @@
# -*- coding: utf-8 -*-
from __future__ import annotations
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very confusing change - is there a reason for this?

Comment thread agent_core/decorators/profiler.py Outdated
@@ -1,4 +1,5 @@
# -*- coding: utf-8 -*-
from __future__ import annotations
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for here, is there a reason for this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont push this - dont want to enable MCPs by default

Comment thread app/config/settings.json
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont push this

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont push this - dont want skills enabled by default

@ahmad-ajmal
Copy link
Copy Markdown
Collaborator

ahmad-ajmal commented Apr 21, 2026

@ahmad-ajmal, hey its so many files because I just deleted the future annotation call I used to test throughout the entire project

revert the from __future__ import annotations deletions. It's unnecessarily bulking up the PR and they have been there for a long time. Don't worry, you didn't add them.

@zfoong zfoong changed the base branch from dev to V1.3.0 April 22, 2026 11:09
Comment thread agent_core/core/llm/google_gemini_client.py
Comment thread agent_core/core/impl/vlm/interface.py
@ahmad-ajmal ahmad-ajmal merged commit 76c1f47 into V1.3.0 Apr 22, 2026
@CraftOS-dev CraftOS-dev deleted the feature/ocr-video-actions branch April 26, 2026 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants