Add adversarial safety fixtures for deep code reasoning mcp (model context protocol mcp server)

## Summary

Exercise prompt/tool/data poisoning and fail-closed behavior for the repo's most sensitive agent-facing path.

This issue was generated from an org-wide EvalOps mining pass on 2026-05-10 07:57 UTC. It combines live GitHub repo signals with a per-repo arXiv search. Treat the research links as grounding for a concrete implementation, not as a request for a literature review.

## Repo Evidence

- Repository description: A Model Context Protocol (MCP) server that provides advanced code analysis and reasoning capabilities powered by Google's Gemini AI
- Tree signals: 0 docs files, 1 workflows, 0 proto files, 6 test-like files.
- `README.md:46` includes latent-spec language: *Note: After installation, you'll need to update the file path to your actual installation directory and set your `GEMINI_API_KEY`.*
- `README.md:251` includes latent-spec language: When Claude needs deep iterative analysis with Gemini:
- `README.md:286` includes latent-spec language: // Claude Code: Identifies the error pattern and suspicious code sections // Escalate to Gemini when: Need to correlate 1000s of trace spans across 10+ services // Gemini: Processes the full trace timeline, identifies the exact race window
- `README.md:296` includes latent-spec language: // Claude Code: Quick profiling, identifies hot paths // Escalate to Gemini when: Need to analyze weeks of performance metrics + code changes // Gemini: Correlates deployment timeline with perf metrics, pinpoints the exact commit
- `README.md:302` includes latent-spec language: When you have theories but need extensive testing:
- `README.md:306` includes latent-spec language: // Claude Code: Forms initial hypotheses based on symptoms // Escalate to Gemini when: Need to test 20+ scenarios with synthetic data // Gemini: Uses code execution API to validate each hypothesis systematically

## Research Grounding

Repo axes: tooling, security, evaluation, governance

Search keywords: gemini, code, claude, string, analysis, api, when, your, file, server, google, need

- [arXiv:2508.07575v1](https://arxiv.org/abs/2508.07575v1) MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark (Shiqing Fan, Xichen Ding, Liang Zhang, Linjian Mo), 2025.
- [arXiv:2602.01129v1](https://arxiv.org/abs/2602.01129v1) SMCP: Secure Model Context Protocol (Xinyi Hou, Shenao Wang, Yifan Zhang, Ziluo Xue, Yanjie Zhao, Cai Fu), 2026.
- [arXiv:2407.00121v1](https://arxiv.org/abs/2407.00121v1) Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks (Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sadhana Kumaravel, Matthew Stallone, Rameswar Panda), 2024.
- [arXiv:2507.19570v1](https://arxiv.org/abs/2507.19570v1) MCP4EDA: LLM-Powered Model Context Protocol RTL-to-GDSII Automation with Backend Aware Synthesis Optimization (Yiting Wang, Wanghao Ye, Yexiao He, Yiran Chen, Gang Qu, Ang Li), 2025.
- [arXiv:2410.17950v1](https://arxiv.org/abs/2410.17950v1) Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling (Nirav Bhan, Shival Gupta, Sai Manaswini, Ritik Baba, Narun Yadav, Hillori Desai), 2024.
- [arXiv:2602.18764v2](https://arxiv.org/abs/2602.18764v2) The Convergence of Schema-Guided Dialogue Systems and the Model Context Protocol (Andreas Schlapbach), 2026.
- [arXiv:2501.10132v1](https://arxiv.org/abs/2501.10132v1) ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario (Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, Jie Tang), 2025.
- [arXiv:2605.02244v1](https://arxiv.org/abs/2605.02244v1) The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents (Yelin Kim), 2026.
- [arXiv:2503.23803v2](https://arxiv.org/abs/2503.23803v2) Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute (Yingwei Ma, Yongbin Li, Yihong Dong, Xue Jiang, Rongyu Cao, Jue Chen), 2025.
- [arXiv:2504.00914v1](https://arxiv.org/abs/2504.00914v1) On the Robustness of Agentic Function Calling (Ella Rabinovich, Ateret Anaby-Tavor), 2025.

## What To Build

- Add adversarial fixtures for prompt/tool/memory poisoning.
- Document the intended fail-closed behavior and any allowed degraded-mode fallback.
- Add regression coverage that proves unsafe inputs do not silently reach the privileged path.

## Acceptance Criteria

- [ ] A short design note names the repo-specific workflow, threat or correctness model, and the research assumptions being adopted.
- [ ] A runnable check, fixture, or verifier exercises the new contract in CI or an equivalent local command documented in the repo.
- [ ] The implementation emits or stores enough evidence for a downstream agent/operator to cite inputs, decisions, and outputs.
- [ ] At least one negative/degraded-mode case is covered so failures are observable rather than silently accepted.
- [ ] Documentation links the new behavior to the relevant EvalOps platform primitive or explicitly records why this repo remains standalone.

## Notes

- Generated issue 3/5 for `evalops/deep-code-reasoning-mcp` by `evalops_org_miner.py`.
- Before implementation, confirm the sampled latent-spec snippets still match `main`; this issue intentionally cites exact file paths/lines where the mining pass saw them.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add adversarial safety fixtures for deep code reasoning mcp (model context protocol mcp server) #38

Summary

Repo Evidence

Research Grounding

What To Build

Acceptance Criteria

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add adversarial safety fixtures for deep code reasoning mcp (model context protocol mcp server) #38

Description

Summary

Repo Evidence

Research Grounding

What To Build

Acceptance Criteria

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions