refactor: migrate PyDeequ Bot to the shared Shadow engine#277
Conversation
Replace the vendored scripts/issue_bot/ bot with a thin caller of the reusable sudsali/shadow workflow (SHA-pinned), matching what awslabs/deequ now does. This kills the code drift between per-repo vendored copies and upgrades PR review from the legacy single-pass path to the 3-agent Investigator/Critic/Reporter pipeline. - .github/workflows/issue-bot.yml: caller of sudsali/shadow@54ec94e - .shadow.yml: PyDeequ codebase config; bot.name kept as deequ-bot so the existing auto-approve.yml clean-marker grep keeps matching - remove vendored scripts/issue_bot/ and its tests/test_bot.py - update-kb.yml: drop stale paths-ignore for the deleted bot dir
| poetry install | ||
| poetry run pip install pyspark==$SPARK_VERSION | ||
| poetry run python -m pytest -s tests --ignore=tests/test_bot.py | ||
| poetry run python -m pytest -s tests |
There was a problem hiding this comment.
BUG: Removing --ignore=tests/test_bot.py without the corresponding deletion of tests/test_bot.py being reflected in the same commit would cause CI to fail if the file still existed. However, the diff shows tests/test_bot.py IS deleted in this PR, so the ignore removal is correct. The real concern is different: the CI matrix will now attempt to import scripts/issue_bot modules (since test_bot.py no longer exists, this is fine). DISPROVED — no issue here.
tests/test_bot.py is deleted in the same PR diff, so removing --ignore is consistent.
| prompt_sm_prefix: pydeequ-bot | ||
| secrets: | ||
| AWS_ROLE_ARN: ${{ secrets.AWS_ROLE_ARN }} | ||
| GUARDRAIL_ID: ${{ secrets.GUARDRAIL_ID }} |
There was a problem hiding this comment.
EDGE_CASE: BEDROCK_REPORTER_MODEL_ID and BEDROCK_CRITIC_MODEL_ID are passed via ${{ vars.* }} (repository variables) but placed under the secrets: block. GitHub Actions allows passing vars as secret inputs (they're just strings), but if the reusable workflow declares these as secrets with required: true, the workflow will fail at startup if the repository variables are not set. Since these are new inputs not present in the old workflow, they must be configured before merge or the bot will break on every trigger.
Lines 57-58:
BEDROCK_REPORTER_MODEL_ID: ${{ vars.BEDROCK_REPORTER_MODEL_ID }}andBEDROCK_CRITIC_MODEL_ID: ${{ vars.BEDROCK_CRITIC_MODEL_ID }}usevars.*context. If the reusable workflow marks these as required secrets and the vars are not configured, the workflow fails.
| run: python -m issue_bot.main act | ||
| working-directory: scripts | ||
| uses: sudsali/shadow/.github/workflows/shadow-review.yml@54ec94e0ca8c90d9b58ff95a2a06b175a115784e | ||
| with: |
There was a problem hiding this comment.
EDGE_CASE: pr_number: ${{ inputs.issue_number }} is only populated during workflow_dispatch. For issues, pull_request_target, and issue_comment events, inputs.issue_number is empty/null, so pr_number will be passed as an empty string. The reusable workflow must handle this gracefully (e.g., by falling back to the event's own issue/PR number). If the reusable workflow requires a non-empty pr_number, the bot will fail on all non-dispatch triggers.
Line 44:
pr_number: ${{ inputs.issue_number }}—inputs.*context is only available forworkflow_dispatch; for other event types it evaluates to empty string.
Summary
Migrates the PyDeequ Bot from a vendored copy of the engine (
scripts/issue_bot/) to consuming the shared Shadow engine (sudsali/shadow) via its reusable workflow, SHA-pinned to54ec94e. One engine maintained upstream instead of a per-repo fork, matching what awslabs/deequ now does. Net diff removes ~2.6k lines of vendored code + tests.What Changes
issue-bot.ymlwith a thin caller ofsudsali/shadow/.github/workflows/shadow-review.yml(all four triggers preserved: issues, issue_comment, pull_request_target, workflow_dispatch).shadow.yml— engine config (Python codebase, bot identity, attribution footer)scripts/issue_bot/) and its tests (tests/test_bot.py); keepsscripts/generate_kb.pytests/test_bot.pyreferences left inupdate-kb.yml(paths-ignore) andbase.yml(pytest--ignore)Behavior Change: PR review moves to the 3-agent pipeline
The vendored bot ran the legacy single-pass file-review path. The shared engine runs the 3-agent Investigator → Critic → Reporter pipeline (the Critic disproves the Investigator's findings before anything is posted, cutting false positives). Issue triage, issue-respond, and follow-up behavior are unchanged.
Behavior Preserved
pydeequ-bot/*), selected viaprompt_sm_prefix: pydeequ-bot; the engine fetches all eight (PR investigator/critic/reporter + 2 commit nudges + issue-classify/respond/followup) in place of its bundled defaults. No new secrets.<!-- deequ-bot:clean -->; the bot's marker identity is deliberately kept asdeequ-botsoauto-approve.ymlneeds no change.Architecture
Security
substaysrepo:awslabs/python-deequ:*); an optionaljob_workflow_reftrust narrowing may be applied after merge (applying it before breaks the base-branch vendored bot)uses:andshadow_refboth pin the exact commit; a moved upstream tag can't change what runs@/#auto-link neutralized, marker-spoof-proof)Files
.github/workflows/issue-bot.yml.shadow.yml.github/workflows/update-kb.ymlpaths-ignorefor the removed bot test.github/workflows/base.yml--ignorefor the removed bot testscripts/issue_bot/tests/test_bot.pyTesting
workflow_dispatchdry_run: trueon GitHub Actions against a live PR:analyze+actsucceed, prompts resolve fromsm:pydeequ-bot/*, the 3-agent pipeline produces findings, and the review is pinned to the analyzed commit. (Results linked in a comment below once the run completes.)Rollback
Revert this PR — the vendored
scripts/issue_bot/returns and runs as before. No data or state migration; secrets and repo variables are untouched.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.