Skip to content

fix(prospecting-overview): propose next steps only when useful, memory-aware, via native widget#105

Merged
milstan merged 4 commits into
mainfrom
ArtyETH06/improve-prospecting-overview-next-step
Jun 19, 2026
Merged

fix(prospecting-overview): propose next steps only when useful, memory-aware, via native widget#105
milstan merged 4 commits into
mainfrom
ArtyETH06/improve-prospecting-overview-next-step

Conversation

@ArtyETH06

@ArtyETH06 ArtyETH06 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

What

leadbay_prospecting_overview was changed (in the first cut of this PR) to always end on a concrete next step. Milan pushed back: forcing a next step on every turn is the ChatGPT "why not?" annoyance — sometimes the status read is the complete answer and a manufactured next step erodes trust.

This revision reworks the instruction to address all three of Milan's points.

Fix

Replaced the unconditional "always propose" rule with a three-part rule in the prompt template:

  1. Decide when it helps. Propose a next step only on a real unfinished thread or blocker (fresh discovery batch, follow-ups due, quota/auth blocker). Skip it when there's no clear thread or the user only wanted status — "don't manufacture a next step just to have one."
  2. Lean on memory. Check _meta.agent_memory.summary for how this user reacts to next-step offers; default to not proposing if they routinely dismiss them, and capture the dismiss/accept signal via leadbay_agent_memory_capture so the preference compounds across sessions.
  3. When proposed, it's a native choice dialog — never prose. Route 2–4 mutually-exclusive moves through the host's next-step widget (ask_user_input_v0 / AskUserQuestion). Pulled in the next-steps/ask-user-input-routing snippet.

WORKFLOWS.md WF7 success criterion softened to match: proposing none is acceptable when the status read is complete; if proposed, it must be a concrete widget-routed action, not reflexive filler.

Files: leadbay_prospecting_overview.md.tmpl (source) + regenerated prompts.generated.ts and the plugin SKILL.md; WORKFLOWS.md contract.

Eval proof (live /eval --workflow 7)

Workflow: 7 — Prospecting overview (leadbay_prospecting_overview), single-turn, routing prompt body injected. Live Leadbay API, 3 consecutive runs.

run MM IA NF TSF pass next step quota read
1 5 5 5 4 ask_user_input_v0 widget, 4 concrete options AUTH_EXPIRED — reported honestly, no fabrication
2 5 5 5 5 ask_user_input_v0 widget, 4 concrete options AUTH_EXPIRED — reported honestly, no fabrication
3 5 5 5 5 ask_user_input_v0 widget, 4 concrete options AUTH_EXPIRED — reported honestly, no fabrication
  • MM mission-match · IA instruction-adherence · NF no-fabrication · TSF tool-selection-fit
  • 3/3 PASS · 2/3 perfect 5/5/5/5. Invariants held every run (leadbay_account_status called; leadbay_report_outreach / mutating tools absent).
  • The TSF 4 in run 1 is judge variance on an extra read-only leadbay_pull_leads call the agent made to enrich the overview — sensible, not a contract violation. Runs 2 and 3 judged the same call 5/5/5/5. Not a clean 3×5/5/5/5 and I'm not claiming one.
  • Milan's third ask verified live: every run routed the next step through the ask_user_input_v0 native widget (e.g. [Refine the lens, Check audience filters, Reconnect first, Triage board]), not prose.

Not covered / caveats

  • The live eval account's quota endpoint returns AUTH_EXPIRED (401), so the agent's honest handling of a failed quota read is what's exercised — quota-figure rendering against real numbers was not tested (same caveat as the prior revision).
  • The memory-suppression branch (don't propose when the user routinely dismisses) is not exercised by this eval — the harness resets agent memory to a fresh-user baseline each run, so the "returning user who always dismisses" path is asserted by the prompt instruction only, not by a live run. Worth a dedicated multi-turn / seeded-memory scenario before relying on it.
  • Single-turn workflow.

The prospecting-overview prompt orients the user but never explicitly required a
next-step proposal, so an overview could end on bare status — leaving the user
stranded (WF7 contract requires 'proposed a concrete next step'). Add an explicit
instruction to close with a concrete next step matched to the user's state
(discovery batch / follow-ups due / unblock quota+auth).

Verified live: 5/5/5/5 across 3 consecutive eval runs (vs IA:3 without the
instruction); next step proposed every round, no fabrication on the live
AUTH_EXPIRED quota state.

Co-Authored-By: Claude <noreply@anthropic.com>
@ArtyETH06 ArtyETH06 self-assigned this Jun 16, 2026
@ArtyETH06

Copy link
Copy Markdown
Contributor Author

🤖 Auto self-review (/eval --improve/review)

The --improve loop opened this PR, then ran /review on the diff. Result: clean — no critical or auto-fix findings, 0 changes applied.

Scope: CLEAN — delivered exactly the intent (one "propose a concrete next step" instruction); no creep.
Generated files in sync:.md.tmpl source + prompts.generated.ts + the assembled SKILL.md all carry the line (build-emitted, not hand-edited).
Structured checklist (SQL / races / shell injection / enums / type coercion): N/A — prose-only prompt instruction, no code paths.
Specialists: skipped (6-line diff, under the 50-line threshold).

Adversarial pass — 2 advisory notes, both sub-3 confidence, not actionable:

  • (conf 3/10) Instruction is unconditional, but correctly scoped to the account-overview flow — there's no competing STOP/await gate in this prompt for it to override.
  • (conf 2/10) Theoretical same-turn overlap with per-tool NEXT STEPS blocks — different altitude (routing nudge vs per-lead table), and defer-to-tool-rendering already governs placement.

Adversarial verdict: ship-as-is — the defer-to-tool-rendering gate explicitly whitelists next-action recommendations, and the line directly fulfills the WORKFLOWS row-7 contract ("proposed a concrete next step").

Eval state unchanged at 5/5/5/5 ×3 (no review fix was applied, so no re-eval was needed). PR stays draft — a human reviews and merges.

@ArtyETH06 ArtyETH06 marked this pull request as ready for review June 17, 2026 17:54
@ArtyETH06 ArtyETH06 requested a review from milstan June 17, 2026 17:54

@milstan milstan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we want to force the agent to always do it - sometimes it does not make sens at all, and we're forcing it to pick next steps to show.

I think we may want to be more subtle here - leverage our concent of memory - to remember if the user is dismissing next step proposals or accepting them (if always dismissing, then not propose)

Also we may want to actually teach the agent to decide when next steps make sens, and when they are just "why not" - I am personally very annoyed that ChatGPT always proposes next steps, even when clearly the work is done.

On the other hand I think we want, when next steps are proposed, to make sure they are presented in the form of a native UI choice dialog.

…y-aware, via native widget

Address Milan's review: don't force a next step on every overview (the ChatGPT
"why not" annoyance). Teach the agent to decide WHEN a next step is genuinely
useful vs. reflexive filler, lean on agent memory to suppress proposals for users
who routinely dismiss them, and render any proposal through the native choice
dialog (ask_user_input_v0 / AskUserQuestion) rather than prose. Soften the WF7
contract criterion to match — proposing none is acceptable when the status read
is a complete answer.

Co-Authored-By: Claude <noreply@anthropic.com>
@ArtyETH06 ArtyETH06 changed the title fix(prospecting-overview): always propose a concrete next step fix(prospecting-overview): propose next steps only when useful, memory-aware, via native widget Jun 17, 2026
ArtyETH06 and others added 2 commits June 17, 2026 15:19
…ippet references

The ask-user-input-routing snippet ends with "pick rows from the (Observation,
Suggest, Calls) table below" and "call the matching Calls tool" — but the overview
had no such table, so in the no-next_steps case (e.g. leadbay_account_status) the
required widget path was under-specified and the agent could hallucinate options or
skip the proposal. Add an overview-specific table (discovery batch / follow-ups /
quota / auth blocker / lens mismatch / healthy → propose none) so the reference
resolves deterministically. Addresses review P2.

Co-Authored-By: Claude <noreply@anthropic.com>
…ate audience adjust

Two user-visible bugs in the added NEXT STEPS table (review P2 ×2):
- leadbay_daily_check_in / leadbay_followup_check_in are MCP PROMPTS, not
  agent-callable tools; a host that can't invoke a prompt from a turn would stall
  or call a non-existent tool. Route to the underlying tools instead
  (leadbay_pull_leads / leadbay_pull_followups), with a note that the Calls column
  must always be a callable leadbay_* tool, never a prompt name.
- The "Adjust the lens audience" option called leadbay_adjust_audience directly,
  but the option carries no sectors/sizes/exclusions and the tool has no required
  inputs — an empty call writes the current filter / may clone the default lens
  (no-op or unwanted change). Gate it: ASK for the ICP details first, then call
  with those params; never call with no args.

Co-Authored-By: Claude <noreply@anthropic.com>
@milstan

milstan commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

why does it say "The live eval account's quota endpoint returns AUTH_EXPIRED (401)"? Do we need to infer that in fact the results it obtained were considered passing becuase it could not uthenticate so it just thaught it's OK?

@ArtyETH06

Copy link
Copy Markdown
Contributor Author

why does it say "The live eval account's quota endpoint returns AUTH_EXPIRED (401)"? Do we need to infer that in fact the results it obtained were considered passing becuase it could not uthenticate so it just thaught it's OK?

To verify, would require #109 to be merged

@milstan milstan merged commit cc7b304 into main Jun 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants