Skip to content

Add user confirmation before sending code to external embedding APIs isuue #174#179

Closed
Bakul2006 wants to merge 8 commits intotirth8205:mainfrom
Bakul2006:bakul2006/warningAPICall
Closed

Add user confirmation before sending code to external embedding APIs isuue #174#179
Bakul2006 wants to merge 8 commits intotirth8205:mainfrom
Bakul2006:bakul2006/warningAPICall

Conversation

@Bakul2006
Copy link
Copy Markdown
Contributor

-Added a confirmation prompt before invoking external embedding APIs

  • Ensures that code is not sent outside the local environment without explicit user consent

    Behavior

  • If user confirms → API call proceeds

  • If user declines → operation is aborted

  • In non-interactive environments (e.g., AI agents, CI) → call is blocked by default

    Impact

  • Prevents accidental code exfiltration

  • Improves transparency and user control

  • Keeps behavior aligned with expectations of local-first tooling

@Bakul2006
Copy link
Copy Markdown
Contributor Author

This change adds a minimal confirmation step before sending code to external embedding APIs. Happy to adjust the behavior based on feedback.

@tirth8205
Copy link
Copy Markdown
Owner

Code Review — Concept valid, implementation has a critical bug

The goal of preventing accidental code exfiltration to cloud embedding APIs is correct. However the implementation has a critical bug and should not be merged as-is.

Critical Bug: input() corrupts the MCP stdio protocol

The confirmation prompt in code_review_graph/embeddings.py (or wherever you placed it — the diff shows it in tools/docs.py) calls input(), which writes to stdout. The MCP server uses stdio transport, where stdout is the JSON-RPC channel. Writing a user-facing prompt to stdout corrupts the protocol and breaks any MCP client connected to the server.

Example: Claude Code calling embed_graph via MCP would receive a corrupted response like:

WARNING: This will send codebase content to google APIs. Continue? (yes/no): {"result": ...}

The JSON parser would fail on the prefix.

Correct approach

For non-interactive environments (MCP server, CI, AI agents), the confirmation should refuse by default without any stdout output. In interactive TTY environments, the prompt must go to stderr (not stdout):

def _confirm_cloud_usage(provider_name: str) -> bool:
    """Confirm cloud provider usage. Safe for both MCP and CLI use."""
    if not sys.stdin.isatty():
        return False  # Non-interactive: always refuse
    # Write to stderr so MCP's stdout channel is not corrupted
    sys.stderr.write(
        f"\nWARNING: embed_graph will send codebase content to {provider_name} APIs.\n"
        "Continue? (yes/no): "
    )
    sys.stderr.flush()
    try:
        return sys.stdin.readline().strip().lower() in ("yes", "y")
    except (EOFError, KeyboardInterrupt):
        return False

Other issues

  1. Placement: The confirmation logic belongs in tools/docs.py's embed_graph() function (the MCP tool layer), not inside EmbeddingStore itself. EmbeddingStore is a library class used in tests and scripts — injecting a confirmation prompt there breaks all automated use.

  2. EmbeddingStore.provider.name.startswith("local:"): The check for whether a provider is local uses a string prefix. This is fragile. Better to check the provider type: isinstance(emb_store.provider, LocalProvider) or add an is_local property.

  3. CI/testing: Any test that creates a non-local EmbeddingStore will now hang waiting for user input unless EOFError is raised. Your try/except EOFError handles this, but only if input() raises it — which it does in non-TTY test environments. The sys.stdin.isatty() check is the reliable way to detect this.

What to keep from this PR

The concept: before invoking a cloud embedding provider, check and block in non-interactive environments, warn in interactive ones. The EmbeddingStore wrapping approach in embed_graph() is the right place. The EOFError/KeyboardInterrupt handling is correct.

Please fix the stdout corruption issue and resubmit. The change is small (move to sys.stdin.isatty() + sys.stderr.write()) but critical.

@Bakul2006
Copy link
Copy Markdown
Contributor Author

Addressed all review feedback and resolved conflicts.
Ready for review.

@tirth8205
Copy link
Copy Markdown
Owner

Skipping auto-merge: after updating against current main, CI shows the following problems:

ruff (13 errors) — mostly W293 (blank line whitespace) and F841 (unused variables); the diff wasn't run through ruff check code_review_graph/ before push.

mypy (1 error): code_review_graph/main.py:739: "FastMCP" has no attribute "_local_only" — accessing a private attribute that doesn't exist.

All 4 test jobs (3.10–3.13) pass, so the feature itself looks fine; just needs a lint/type-check pass. Please run ruff check --fix code_review_graph/ and mypy code_review_graph/ --ignore-missing-imports --no-strict-optional and push a new commit, and I'll re-review.

tirth8205 added a commit that referenced this pull request Apr 11, 2026
…#174)

Closes #174 (supersedes #179 which overlapped with #207 and used
input()-on-stdio that would corrupt MCP transport).

``get_provider()`` now prints an explicit warning to **stderr** (never
stdout/stdin) before returning a Google Gemini or MiniMax provider:

    ⚠️  code-review-graph: about to embed code via the 'google'
        cloud provider.
        Your source code (function names, docstrings, file paths)
        will be sent to an external API.
        To skip this warning in future runs, set
        CRG_ACCEPT_CLOUD_EMBEDDINGS=1 in your environment.
        To stay fully offline, use the default 'local' provider
        instead (no API key needed).

Why stderr: the MCP stdio transport uses stdout for JSON-RPC. Writing
to stdout or reading from stdin would corrupt the transport. stderr
is safe and is captured by Claude Code's MCP logs.

Why a warning not a block: the user already explicitly chose a cloud
provider (either via ``provider="google"`` arg or via the MiniMax API
key + explicit provider selection). Blocking would be paternalistic;
warning is informative.

The ``CRG_ACCEPT_CLOUD_EMBEDDINGS=1`` env var lets CI / scripted
workflows acknowledge once and move on without repeated noise.

Tests (4 new in TestCloudProviderWarning):
- test_minimax_triggers_stderr_warning — captures stderr, asserts
  warning content, asserts stdout stays clean
- test_google_triggers_stderr_warning — same for google
- test_accept_env_var_suppresses_warning — CRG_ACCEPT_CLOUD_EMBEDDINGS=1
  silences the warning
- test_local_provider_never_warns — offline path stays quiet

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tirth8205 added a commit that referenced this pull request Apr 11, 2026
… (supersedes #204 #207 #179) (#228)

* docs(troubleshooting): add quick-reference for the 4 most common support patterns

These four issues account for the majority of support questions:

1. "Hooks use a matcher + hooks array" — users on pre-v2.2.3 releases.
   Answer: upgrade to v2.2.4 and re-run `code-review-graph install`.
2. "code-review-graph: command not found" after pip install — PATH issue.
   Answer: use pipx, uvx, or `python -m code_review_graph`.
3. "Is this project-scoped or user-scoped?" — 4-piece scope table
   (package is user-scoped, graph+mcp config are project-scoped,
   multi-repo registry is user-scoped).
4. "Built the graph but Claude Code doesn't see it in a new session" —
   didn't restart Claude Code / wrong cwd / ran build but not install.

The new section is placed at the top of TROUBLESHOOTING.md so it's
found first by anyone searching for these error messages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add Elixir parser support (#112)

Closes #112. tree-sitter-elixir is already bundled with
tree-sitter-language-pack so no new runtime dep.

Elixir's AST is uniform: ``defmodule``, ``def``/``defp``/``defmacro``,
``alias``/``import``/``require``/``use``, and ordinary function
invocations all share the ``call`` node type and are distinguished
only by the leading identifier text. Added a dedicated
``_extract_elixir_constructs`` helper (following the R and Lua pattern)
that dispatches by first-identifier:

- ``defmodule Name do ... end`` -> Class node, recurse into do_block
  with ``enclosing_class=Name``
- ``def`` / ``defp`` / ``defmacro`` / ``defmacrop`` -> Function/Test
  node attached to the enclosing module, recurse into do_block with
  ``enclosing_func=<fn>`` so nested calls resolve to qualified names
- ``alias`` / ``import`` / ``require`` / ``use`` -> IMPORTS_FROM edge
  targeting the module name (flat or dotted, e.g. ``Foo.Bar``)
- Anything else -> CALLS edge with the leading identifier as the target
  name, followed by recursion into arguments + do_block for nested
  calls

Helpers added:
- ``_elixir_call_identifier`` — handles both plain identifiers and
  ``dot > alias + identifier`` for ``IO.puts`` style calls
- ``_elixir_module_name`` — extracts a module name from an arguments
  subtree (``alias`` node or ``dot`` node)
- ``_elixir_function_name_and_params`` — extracts the function name
  from the inner call node inside ``def``'s arguments

Verified on ``tests/fixtures/sample.ex`` (two modules with 6 functions
total across public/private def/defp, plus alias/import/require):
- 9 nodes, 16 edges
- 2 Class nodes (Calculator, MathHelpers)
- 6 Function nodes with correct parent_name
- 3 IMPORTS_FROM (alias, import, require)
- 5 CALLS with internal resolution: Calculator.compute -> Calculator.add,
  MathHelpers.double -> Calculator.compute, MathHelpers.triple ->
  MathHelpers.double, etc.

Tests (7 new in TestElixirParsing):
- Language detection (.ex + .exs)
- Module -> Class mapping
- def/defp producing Function nodes with correct parents
- alias/import/require producing IMPORTS_FROM
- Internal call resolution to qualified names
- CONTAINS edges linking module -> functions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: apply_refactor_tool dry-run mode returning unified diffs (#176)

Closes #176. The existing rename_preview / apply_refactor flow was
technically two-step, but apply_refactor wrote to disk immediately so
there was no way to inspect the exact diff before committing.

apply_refactor() now takes an optional ``dry_run: bool`` parameter. When
true:
- Compute the same edit plan as the real write path, grouping edits by
  file so multiple edits to the same file apply sequentially against
  updated content (fixes a subtle bug where separate edits on the same
  file could stomp each other)
- Return a unified-diff string per would-be-modified file using
  ``difflib.unified_diff``
- Leave the refactor_id in the pending cache so the user can review the
  diff then call again with ``dry_run=False`` to commit

The MCP tool wrapper (``apply_refactor_tool`` in main.py +
``apply_refactor_func`` in tools/refactor_tools.py) now exposes the
``dry_run`` parameter. The real-write path was also refactored to
share the plan-computation step with dry-run — multi-edit files now
work correctly in both modes.

Tests (2 new in test_refactor.py):
- test_apply_refactor_dry_run_returns_diff_without_writing: confirms
  the diff contains both old and new names, the file on disk is
  unchanged, the refactor_id is still valid after the dry-run, and a
  follow-up real apply with the same id succeeds and consumes the id.
- test_apply_refactor_dry_run_no_edits: empty edit list returns an
  empty diffs dict.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: CRG_DATA_DIR / CRG_REPO_ROOT env vars for out-of-tree graph storage (#155)

Closes #155 (supersedes #207 which had unresolved review comments about
input()-on-stdio and _local_only fragility). New env vars let users
keep the .code-review-graph directory outside their working tree —
useful for ephemeral workspaces, Docker volumes, shared CI caches, and
multi-repo orchestrators where scripting the CLI from outside the
repo is necessary.

## CRG_DATA_DIR

When set, replaces the default ``<repo>/.code-review-graph`` directory
verbatim. New ``get_data_dir(repo_root)`` helper in incremental.py is
the single source of truth — it creates the directory (respecting the
override) and writes the auto-generated inner ``.gitignore`` with ``*``
so graph files are never committed.

Call sites updated to use ``get_data_dir()``:
- ``incremental.py::get_db_path`` (the graph.db location)
- ``cli.py::visualize`` (graph.html)
- ``cli.py::wiki`` (wiki output dir)
- ``tools/docs.py::generate_wiki_func`` (MCP wiki tool)
- ``tools/docs.py::get_wiki_page_func`` (MCP wiki lookup)

The legacy ``<repo>/.code-review-graph.db`` migration is preserved
but only runs when CRG_DATA_DIR is unset (there's no relationship
between the legacy location and an override location).

## CRG_REPO_ROOT

``find_project_root()`` now checks ``CRG_REPO_ROOT`` before the usual
git-root walk. Useful for anyone scripting ``code-review-graph`` from
a cwd that isn't inside the target repo (daemons, CI jobs, wrappers).
Falls through to normal resolution if the path doesn't exist.

## Deliberate design choices (vs PR #207)

- **No input() / confirmation prompts** — this branch of code runs
  from the CLI entry and can also be pulled into MCP wrappers which
  pipe stdio. Reading from stdin there would corrupt JSON-RPC.
- **No ``mcp._local_only`` attribute** — that approach was fragile
  because it touched a private FastMCP attribute.
- **Tilde expansion** — both env vars go through ``Path.expanduser()``
  so ``CRG_DATA_DIR=~/Library/crg`` works.

Tests (5 new in TestDataDir):
- test_default_uses_repo_subdir — unset env, default behavior
- test_env_override_replaces_repo_subdir — CRG_DATA_DIR works
- test_get_db_path_uses_data_dir — graph.db follows the override
- test_find_project_root_env_override — CRG_REPO_ROOT works
- test_find_project_root_env_override_missing_dir_falls_through —
  bogus override doesn't crash, falls back to git root

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: stderr warning before cloud embedding providers exfiltrate code (#174)

Closes #174 (supersedes #179 which overlapped with #207 and used
input()-on-stdio that would corrupt MCP transport).

``get_provider()`` now prints an explicit warning to **stderr** (never
stdout/stdin) before returning a Google Gemini or MiniMax provider:

    ⚠️  code-review-graph: about to embed code via the 'google'
        cloud provider.
        Your source code (function names, docstrings, file paths)
        will be sent to an external API.
        To skip this warning in future runs, set
        CRG_ACCEPT_CLOUD_EMBEDDINGS=1 in your environment.
        To stay fully offline, use the default 'local' provider
        instead (no API key needed).

Why stderr: the MCP stdio transport uses stdout for JSON-RPC. Writing
to stdout or reading from stdin would corrupt the transport. stderr
is safe and is captured by Claude Code's MCP logs.

Why a warning not a block: the user already explicitly chose a cloud
provider (either via ``provider="google"`` arg or via the MiniMax API
key + explicit provider selection). Blocking would be paternalistic;
warning is informative.

The ``CRG_ACCEPT_CLOUD_EMBEDDINGS=1`` env var lets CI / scripted
workflows acknowledge once and move on without repeated noise.

Tests (4 new in TestCloudProviderWarning):
- test_minimax_triggers_stderr_warning — captures stderr, asserts
  warning content, asserts stdout stays clean
- test_google_triggers_stderr_warning — same for google
- test_accept_env_var_suppresses_warning — CRG_ACCEPT_CLOUD_EMBEDDINGS=1
  silences the warning
- test_local_provider_never_warns — offline path stays quiet

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tirth8205
Copy link
Copy Markdown
Owner

Thank you for this — closing as superseded by PR #228, which is a clean reimplementation of your changes on top of current `main` (v2.2.4 + the features from #227). The bundle ships in v2.3.0.

Your original design + issue discussion pushed this forward; really appreciate the work. If there's anything in the reimplementation you think is missing or wrong, please open a follow-up issue and I'll address it quickly.

@tirth8205 tirth8205 closed this Apr 11, 2026
tirth8205 added a commit that referenced this pull request Apr 11, 2026
Additive feature release on top of v2.2.4 — new languages, new platform,
new env vars, new MCP tool parameters. No breaking changes.

New:
- #112 Elixir parser (.ex/.exs) — modules, def/defp/defmacro,
  alias/import/require/use, internal call resolution
- #88 Objective-C parser (.m) — @interface/@implementation, message
  expressions, C-style main, #import
- #197 Bash/Shell parser (.sh/.bash/.zsh) — functions, commands, source
  imports with path resolution
- #83 Qwen Code as a supported MCP install platform
- #176 apply_refactor_tool dry-run mode (returns unified diffs without
  writing, keeps refactor_id valid for follow-up)
- #155 CRG_DATA_DIR / CRG_REPO_ROOT env vars for out-of-tree graph
  storage and scripting from outside the repo
- #173 install preview + --no-instructions + -y/--yes (supersedes
  closed PR #204)
- #174 cloud embeddings stderr warning + CRG_ACCEPT_CLOUD_EMBEDDINGS
  opt-out (supersedes closed PR #179)

Fixed:
- Multi-edit refactor correctness bug (same-file edits could stomp
  each other)

Docs:
- TROUBLESHOOTING.md quick-reference section for the 4 most common
  support questions (hooks schema error, command not found, scoping,
  "built but not visible")

Verified locally on Python 3.11:
- ruff clean, mypy clean, bandit clean
- 735 tests pass (+18 vs v2.2.4), coverage 74.63%
- 6-repo smoke test still passes end-to-end on express/fastapi/flask/
  gin/httpx/next.js

Supersedes closed PRs:
- #204 (install preview, credit @lngyeen)
- #207 (CRG_DATA_DIR, credit @yashmewada9618)
- #179 (cloud embeddings warning, credit @Bakul2006)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tirth8205 added a commit that referenced this pull request Apr 11, 2026
Additive feature release on top of v2.2.4 — new languages, new platform,
new env vars, new MCP tool parameters. No breaking changes.

New:
- #112 Elixir parser (.ex/.exs) — modules, def/defp/defmacro,
  alias/import/require/use, internal call resolution
- #88 Objective-C parser (.m) — @interface/@implementation, message
  expressions, C-style main, #import
- #197 Bash/Shell parser (.sh/.bash/.zsh) — functions, commands, source
  imports with path resolution
- #83 Qwen Code as a supported MCP install platform
- #176 apply_refactor_tool dry-run mode (returns unified diffs without
  writing, keeps refactor_id valid for follow-up)
- #155 CRG_DATA_DIR / CRG_REPO_ROOT env vars for out-of-tree graph
  storage and scripting from outside the repo
- #173 install preview + --no-instructions + -y/--yes (supersedes
  closed PR #204)
- #174 cloud embeddings stderr warning + CRG_ACCEPT_CLOUD_EMBEDDINGS
  opt-out (supersedes closed PR #179)

Fixed:
- Multi-edit refactor correctness bug (same-file edits could stomp
  each other)

Docs:
- TROUBLESHOOTING.md quick-reference section for the 4 most common
  support questions (hooks schema error, command not found, scoping,
  "built but not visible")

Verified locally on Python 3.11:
- ruff clean, mypy clean, bandit clean
- 735 tests pass (+18 vs v2.2.4), coverage 74.63%
- 6-repo smoke test still passes end-to-end on express/fastapi/flask/
  gin/httpx/next.js

Supersedes closed PRs:
- #204 (install preview, credit @lngyeen)
- #207 (CRG_DATA_DIR, credit @yashmewada9618)
- #179 (cloud embeddings warning, credit @Bakul2006)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants