Gpt image 2 support#193
Merged
Merged
Conversation
Signed-off-by: 杨峻骁 <yangjunx21@mails.tsinghua.edu.cn>
Signed-off-by: 杨峻骁 <yangjunx21@mails.tsinghua.edu.cn>
Signed-off-by: 杨峻骁 <yangjunx21@mails.tsinghua.edu.cn>
POSIX-specific assertions were baked into six tests, which made the full `pnpm test` suite fail on Windows even though the production code is already cross-platform: - token-store: `0o600` mode bits aren't enforceable on NTFS (reports 0o666); guard the assertion with `process.platform !== 'win32'`. - skills/loader: `new URL(...).pathname` yields `/D:/...` on Windows, so `readdir` sees zero files; use `fileURLToPath()` instead. - opencode-config, locale-ipc, preferences-ipc: replace hard-coded forward-slash path strings with `path.join()`-built expectations that mirror whatever separator the host OS uses. - boot-fallback: `/dev/null/...` is only guaranteed-unwritable on POSIX; build a parent-is-a-regular-file path instead so `mkdirSync` throws ENOTDIR on both platforms. All 10 workspace packages' tests now pass on Windows. Made-with: Cursor Signed-off-by: 杨峻骁 <yangjunx21@mails.tsinghua.edu.cn>
Made-with: Cursor Signed-off-by: 杨峻骁 <yangjunx21@mails.tsinghua.edu.cn>
CodeQL flagged the anchored-quantifier pair `/\/+$/` + `/^\/+/` inside `joinEndpoint` (packages/providers/src/images.ts) as a potential ReDoS on library-supplied input. Replace both regex calls with explicit single-pass scans over the trailing/leading `/` characters — same behaviour, trivially linear, no CodeQL alert. Also unblock the Windows test run on this branch: - `token-store.test.ts`: a new 0o600-mode assertion added on main fails on NTFS (always reports 0o666); guard it the same way the existing sibling assertion is guarded. - `safe-read.test.ts`: the symlink-acceptance case requires admin / Developer Mode on Windows and otherwise throws EPERM; skip the case when symlink creation is denied, keeping full coverage on POSIX CI. Signed-off-by: 杨峻骁 <yangjunx21@mails.tsinghua.edu.cn> Made-with: Cursor
Resolve conflict in Settings.tsx: close JSX tags around the new ImageGenerationPanel Save button/wrappers that were dropped during the merge, and drop hardcoded hex fallbacks (#16a34a, #d97706) and text-[10px] in the image-generation status badge in favor of --color-success / --color-warning / text-[var(--text-xs)] tokens. Made-with: Cursor
Contributor
There was a problem hiding this comment.
Findings
- [Major] Enabled image generation silently degrades by disabling the image tool when credentials are missing. This violates the project rule to avoid silent fallbacks and can produce non-image outputs without explicit user-facing failure. Evidence:
apps/desktop/src/main/image-generation-settings.ts:124,apps/desktop/src/main/index.ts:378,apps/desktop/src/main/index.ts:433.
Suggested fix:const cfg = getCachedConfig(); const imageConfig = cfg ? resolveImageGenerationConfig(cfg) : null; if (cfg?.imageGeneration?.enabled === true && imageConfig === null) { throw new CodesignError( 'Image generation is enabled but credentials are missing or invalid. Update Settings > Image Generation.', ERROR_CODES.PROVIDER_AUTH_MISSING, ); }
Summary
- Review mode: initial
- Not found in repo/docs:
docs/VISION.md,docs/PRINCIPLES.md.
Testing
- Not run (automation)
open-codesign Bot
| throw err; | ||
| } | ||
| } | ||
| : undefined; |
Contributor
There was a problem hiding this comment.
[Major] This branch silently disables generate_image_asset when image generation is enabled but credentials are unavailable (resolveImageGenerationConfig returns null). That is a silent fallback rather than a surfaced error.
Suggested fix:
const cfg = getCachedConfig();
const imageConfig = cfg ? resolveImageGenerationConfig(cfg) : null;
if (cfg?.imageGeneration?.enabled === true && imageConfig === null) {
throw new CodesignError(
'Image generation is enabled but credentials are missing or invalid. Update Settings > Image Generation.',
ERROR_CODES.PROVIDER_AUTH_MISSING,
);
}
7 tasks
This was referenced Apr 26, 2026
Sun-sunshine06
added a commit
that referenced
this pull request
Jun 6, 2026
… 1 of #225) (#241) ## Summary Phase 1 of #225: a single-image → componentized `ui_kit/` decomposition pipeline that emits a coding-agent-ready bundle, plus deterministic + vision verifiers that self-check parity using a 12-question boolean rubric and re-iterate on gaps. Uses existing `userImages` plumbing (PR #193) and adds three new agent tools that mirror existing patterns (`done.ts` / `generate-image-asset.ts`). Ends in the chat sidebar with a one-click trigger that fires a structured prompt, walks the agent through decompose → verify → reconcile → done, and surfaces per-decompose cost as a toast. No new prod deps, no SQLite schema change, in-memory output via the Files panel. This PR addresses Phase 1 of #225 only. The Phase 2 (gpt-image-2 generation in the loop) and Phase 3 (multi-page flow) cuts I committed to in the issue thread are intentionally not included. ## 2026-06-06 rebase update Rebased onto current `OpenCoworkAI/open-codesign:main` at `b2d020d` and force-pushed the PR branch to `eed7cbc`. GitHub now reports the PR as mergeable again. The conflict resolution preserves current `main` architecture: - `packages/core/src/index.ts` keeps the current `inspect_workspace` public exports and only appends the visual parity types/functions. The legacy `read_design_system` core public export was not restored. - Generate IPC wiring now lives in `apps/desktop/src/main/ipc/generate.ts`; runtime FS source-image seeding lives in `apps/desktop/src/main/ipc/runtime-fs.ts`. - Renderer cost-toast logic now lives in the sliced store at `apps/desktop/src/renderer/src/store/slices/chat.ts`. - The first image attachment is seeded as `source.png` for `verify_ui_kit_visual_parity`, with regression coverage in `apps/desktop/src/main/index.workspace.test.ts`. Local verification after rebase: - `pnpm lint` - `pnpm --filter @open-codesign/core typecheck` - `pnpm --filter @open-codesign/desktop typecheck` - `pnpm --filter @open-codesign/core test` - `pnpm --filter @open-codesign/desktop test -- src/main/index.workspace.test.ts src/main/ipc/generate.workspace-rename.test.ts` - `pnpm --filter @open-codesign/providers test` Note: local pre-push full `pnpm test` hit a transient timeout in `packages/providers/src/codex/oauth-server.test.ts` during the concurrent turbo run; the same providers test passed immediately when rerun directly. GitHub CI is now the source of truth for the full matrix on the pushed head. ## Type of change - [x] New feature ## Linked issue Refs #225 (Phase 1 only — Phase 2/3 deferred per [my comment](#225 (comment))) ## What's in here **3 new agent tools** in `packages/core/src/tools/`: 1. `decompose-to-ui-kit.ts` — orchestrator. Takes a source image (from chat context) + design brief, emits `ui_kits/<slug>/{index.html, components/*.tsx, tokens.css, manifest.json, README.md}` to the virtual FS. Output carries `schemaVersion: 1` so downstream coding agents (Claude Code, Cursor) can evolve safely. 2. `verify-ui-kit-parity.ts` — deterministic verifier. 3 signals: element-count parity, visible-text coverage, token coverage. Returns a `ParityReport` with `passCount/totalChecks` derived score (no LLM in the loop, no floats). 3. `verify-ui-kit-visual-parity.ts` — vision-LLM judge wrapper. Takes a host-injected `judgeVisualParity` callback, runs a 12-check boolean rubric across 5 dimensions (layout / color / typography / content / components), returns `parityScore = passCount / totalChecks` and a bounded-enum `status` (`verified | needs_review | needs_iteration | failed | unavailable`). **Host wiring** in `apps/desktop/src/main/`: - `render-ui-kit.ts` — offscreen `BrowserWindow.capturePage()` for the rendered ui_kit - `judge-visual-parity.ts` — vision-judge prompt builder + LLM dispatcher using the existing `complete()` provider abstraction - `ipc/generate.ts` — injects `renderUiKit` + `judgeVisualParity` into the agent runtime alongside `generate_image_asset` - `ipc/runtime-fs.ts` — seeds image attachments into the runtime FS, including default `source.png` for visual parity **Renderer**: - `AddMenu.tsx` — new "Decompose to UI Kit" entry, disabled when no artifact / generation in flight - `Sidebar.tsx` — `triggerDecompose(designId, locale)` action wired to the menu item - `store.ts` / `store/slices/chat.ts` — 3-branch toast feedback (busy / unavailable / started) + per-tool-call cost row when the visual judge resolves - `hooks/decomposePrompt.ts` — locale-aware (EN/ZH) structured prompt that walks the agent through decompose → verify → reconcile → iterate (max 2) → done with HONEST cost summary **Tests** — full vitest coverage in `*.test.ts` next to each tool: - `decompose-to-ui-kit.test.ts` (263 LOC) - `verify-ui-kit-parity.test.ts` (180 LOC) - `verify-ui-kit-visual-parity.test.ts` (295 LOC) **i18n** — 9 new keys × EN + ZH for the menu entry, toast titles/descriptions, and cost row. ## Design decisions **Boolean rubric, not floats.** Every visual parity check is `{passed: boolean}`, derived `parityScore = passCount / totalChecks`. The `status` field is a bounded enum derived from thresholds (100% → `verified`, ≥85% → `needs_review`, ≥60% → `needs_iteration`, <60% → `failed`). No LLM-fabricated confidence floats, no scoring inflation. Aligns with the project's `HONEST_SCORES` precedent (`done.ts`'s `verified: boolean` field). **Host-injected callbacks, not framework lock-in.** `verify-ui-kit-visual-parity.ts` doesn't import any LLM SDK or any Electron API. It takes `RenderUiKitFn` and `JudgeVisualParityFn` as deps. If the host doesn't inject them (e.g. a future headless CLI), the tool returns `status: 'unavailable'` honestly instead of crashing. Mirrors how `generate_image_asset` is keyed on `deps.generateImageAsset`. **In-memory output via Files panel, no schema bump.** Per my open binary in the issue thread, this PR ships option (a): the `ui_kits/<slug>/` lands in the design's virtual FS, surfaces in the existing Files panel, and uses the existing ZIP export for handoff to a coding agent. No SQLite migration, smallest blast radius, consistent with how `polishPrompt.ts`'s second-pass mutates only in-memory state. **`schemaVersion: 1` on the manifest.** Downstream consumers (Claude Code, Cursor) need a stable contract. Adding fields requires no version bump; renaming or removing fields requires `schemaVersion: 2` and a parallel-emit window. ## Anti-hallucination guardrails The deterministic verifier (`verify-ui-kit-parity.ts`) checks visible-text coverage on the emitted ui_kit vs the source brief — if the agent dropped any text content, it fails BEFORE the LLM judge runs. This catches data hallucination cheap. The LLM judge then handles only semantic-quality dimensions (visual hierarchy, color harmony, typography pairing, etc.). ## Cost surfacing Every `verify_ui_kit_visual_parity` resolution pushes a toast with `passCount/totalChecks · status · $cost.NNNN`. Reads defensively from `result.details` so future contract drift degrades silently rather than crashing the renderer. The `done` tool's prompt-driven summary additionally requires the agent to report total run cost, per the `HONEST_STATUS` precedent. ## Checklist - [x] I read [`docs/VISION.md`](../docs/VISION.md), [`docs/PRINCIPLES.md`](../docs/PRINCIPLES.md), and [`CLAUDE.md`](../CLAUDE.md) before starting - [x] Commits are signed with DCO (`git commit -s`) - [x] Rebased onto current `main`; `pnpm lint`, targeted typechecks, core test, desktop runtime/generate tests, and providers test pass locally (full GitHub CI is re-running on `eed7cbc`) - [x] Added/updated tests for the change (738 LOC across 3 new test files) - [x] Added a changeset (`pnpm changeset`) — see `.changeset/decompose-to-ui-kit.md` - [x] Updated docs if behavior changed — `BENCHMARKS.md` (new), `README.md` + `README.zh-CN.md` (Decompose to UI Kit feature card + hero PNG + iter-reel GIF) ## Dependency additions (if any) None. All three new tools use only `@mariozechner/pi-agent-core`'s `AgentTool` factory pattern that's already a prod dep. ## Screenshots / recordings (UI changes) **Side-by-side hero — source vs agent-emitted ui_kit (`e2e-opus-final` run, parityScore 0.90):**  **4-frame reconcile reel from the `e2e-nodebench-iter` run (iter-0 → iter-1 with honest score drift 0.82 → 0.78 — boolean rubric exposes the regression instead of hiding it):**  [MP4 version](https://raw.githubusercontent.com/HomenShum/open-codesign/feat/decompose-to-ui-kit/website/public/demos/decompose-iter-reel.mp4) for higher fidelity. **Live-recorded session demo** (real Electron app, no stitching) — recording in progress, will edit this PR description when the GIF is ready. ETA same day. ## Cross-tier benchmarks `BENCHMARKS.md` at repo root has the full methodology + run-by-run real-data results across model tiers (Opus, Pro+Pro+iterate, Kimi+Gemini3, NodeBench iter), reproducibility instructions, honest non-claims, and research citations (WebDevJudge, Prometheus-Vision, Trust-but-Verify ICCV 2025). | Run | Decompose | Judge | parityScore | Gaps surfaced | |---|---|---|---:|---:| | e2e-opus-final | claude-opus-4-1 | claude-opus-4-1 | 0.90 | 4 | | e2e-nodebench-iter (iter-0) | gemini-3-pro-preview | gemini-3-pro-preview | 0.82 | 6 | | e2e-nodebench-iter (iter-1) | gemini-3-pro-preview | gemini-3-pro-preview | 0.78 | 5 | | e2e-bank-kimi-gemini3 | kimi-k2.6 | gemini-3-pro-preview | 0.78 | 8 | | e2e-nodebench-B | kimi-k2.6 | gemini-3-pro-preview | 0.60 | 7 | Note the iter-0 → iter-1 regression on the same source: agent fixed some gaps but introduced new layout drift. The boolean rubric exposes this honestly rather than fudging the score upward. This is the intended behavior, not a bug. ## Scope discipline notes - **PR size**: ~1500 LOC of substantive change (3 tools + 3 test files + agent wiring + i18n + 1 hook). Most of the diff stat (`pnpm-lock.yaml`) is mechanical regen. This is over the soft 400-LOC bar in CONTRIBUTING.md, but it's been pre-discussed in #225 and the change is a single concern (one new feature path, no refactor mixed in). Happy to split into 3 PRs (per-tool) if maintainer prefers — say the word. - **What's NOT in scope** (from #225 thread): multi-page flow (Phase 3, separate issue), gpt-image-2 generation step (Phase 2, separate Discussion), persistence-to-disk (option (b) from the binary I posed — staying with option (a) for blast radius) - **Three systemic dependencies surfaced during dogfood** (rollback / capability-aware failover / spiral-detector): filing as separate Discussions in `Ideas` category, not bundling here. Each is a meaningful subsystem that deserves alignment before code. ## Branch state at PR open - 9 commits ahead of `upstream/main` - 11 commits behind (mostly `chore(deps)` bumps including pi-agent-core 0.67.68 → 0.70.2; my branch is on 0.67.68) - **Will rebase against latest main on request** — wanted to open the PR with the as-built state for clarity first. The pi-agent-core 0.70.2 bump may require small adjustments to the new tools' `AgentTool` shape; I'll handle that in the rebase pass. ## Why this is ready to review now - Real cross-tier benchmarks in `BENCHMARKS.md`, not synthetic - Visual proof embedded above (hero + reel) - Test coverage matches existing tools - Pattern conformance: every new file mirrors an existing precedent - Deliberate scope: closes Phase 1 of the issue cleanly, defers the rest visibly Looking forward to feedback. Happy to address structural concerns first before iterating on smaller polish. --------- Signed-off-by: homen <hshum2018@gmail.com> Signed-off-by: Sun-sunshine06 <Sun-sunshine06@users.noreply.github.com> Co-authored-by: Sun-sunshine06 <Sun-sunshine06@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lets the agent call gpt-image-2 (or OpenRouter image models) on demand to generate bitmap assets (logos, hero images, illustrations) while producing a design, and embeds them seamlessly in preview + exports. Off by default; users opt in from Settings.
Type of change
Linked issue
Checklist
docs/VISION.md,docs/PRINCIPLES.md, andCLAUDE.mdbefore startinggit commit -s)pnpm lint && pnpm typecheck && pnpm testpasses locallypnpm changeset) if user-visibleDependency additions (if any)
Screenshots / recordings (UI changes)