Skip to content

perf(db): add compound indexes to fix query regression from TS migration#632

Merged
carlos-alm merged 5 commits intomainfrom
perf/benchmark-regression-indexes
Mar 26, 2026
Merged

perf(db): add compound indexes to fix query regression from TS migration#632
carlos-alm merged 5 commits intomainfrom
perf/benchmark-regression-indexes

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Add migration v16 with idx_edges_kind_target and idx_edges_kind_source compound indexes on the edges table. The fan-in/fan-out GROUP BY subqueries (used by fnDeps, roles, triage) need kind-first indexes to avoid full-table scans. Without them, the 32% graph growth from the TypeScript migration caused a disproportionate 136% fnDeps slowdown.
  • Optimize classifyNodeRoles (both full and incremental paths) to skip ~4,100 parameter/property nodes from the expensive fan-in/fan-out JOINs. These leaf nodes are always dead-leaf by definition and can never have callers or callees.

Measured locally: fan-in subquery ~8ms → 0.43ms, roles query 7ms → 5.3ms, fnDeps path 8.5ms → ~1ms.

Addresses regressions reported in #625, #626, #627.

Test plan

  • Full test suite passes (2,128 tests, 0 failures)
  • codegraph build --no-incremental succeeds, new indexes created
  • codegraph stats shows unchanged node/edge counts
  • Local benchmark confirms query latency improvement

…tion

Add migration v16 with idx_edges_kind_target and idx_edges_kind_source
indexes. The fan-in/fan-out GROUP BY subqueries used by fnDeps, roles,
and triage need kind-first indexes to avoid full-table scans on the
edges table. Without them, a 32% growth in graph density (from the
TypeScript migration) caused a disproportionate 136% query slowdown.

Also optimize classifyNodeRoles (both full and incremental paths) to
skip parameter and property nodes from the expensive fan-in/fan-out
JOINs. These ~4,100 nodes are always classified as dead-leaf by
definition and can never have callers or callees.

Measured locally: fan-in subquery ~8ms → 0.43ms, roles query 7ms → 5.3ms.
@claude
Copy link
Copy Markdown

claude bot commented Mar 26, 2026

Claude finished @carlos-alm's task —— View job


Reviewing PR: Performance indexes for query regression fix

Todo List:

  • Read and analyze changed files
  • Phase 1: Root Cause Analysis
  • Phase 2: Configuration Change Detection
  • Phase 3: Test Integrity
  • Phase 4: Code Quality Review
  • Phase 5: Backlog Compliance Assessment
  • Provide structured final assessment

Phase 1: Root Cause Analysis ✓

Problem: TypeScript migration caused 32% graph growth → 136% fnDeps slowdown due to missing compound indexes
Solution: Add idx_edges_kind_target/idx_edges_kind_source indexes + skip 4,100 leaf nodes from expensive JOINs
Risk: Low - purely additive, preserves correctness

Phase 2: Configuration Change Detection ✓

HIGH-RISK CHANGES: ✅ NONE - No config/linting/build changes

Phase 3: Test Integrity ✓

RED FLAGS: ✅ NONE - No test modifications, author reports all 2,128 tests pass

Phase 4: Code Quality Review ✓

MIGRATION v16: Proper compound index ordering, uses IF NOT EXISTS, follows patterns
OPTIMIZATION: Correctly identifies parameter/property as structural leaves, safe 4,100-node skip
TECHNICAL: Index order matches access patterns, incremental path preserved, no behavior changes

Phase 5: Backlog Compliance Assessment ✓

Root Cause Analysis

Problem Being Solved: TypeScript migration regression - 32% graph growth caused disproportionate 136% query slowdown due to missing compound indexes for fan-in/fan-out subqueries.
Why This Approach: Database-first solution targeting the actual bottleneck (index scans) plus algorithmic optimization (skip leaf nodes that can't participate in call relationships).
Risk Assessment: Minimal - purely additive schema changes + logically sound optimization that preserves all semantics.

Backlog Compliance

  • Zero-dep: ✅ - No new runtime dependencies, builds on existing SQLite/tree-sitter infrastructure
  • Foundation-aligned: ✅ - Pure performance optimization, no cloud APIs, maintains offline-first principle
  • Problem-fit: 5 - Directly addresses core issue: prevents AI agents from experiencing slow codegraph queries that waste tokens during re-orientation in large codebases
  • Breaking: No - Schema migrations are backward compatible, no API changes, preserves all existing behavior
  • Tier: 1 (zero-dep ✅ + foundation-aligned ✅)

Final Recommendation

  • Rating: ⭐⭐⭐⭐⭐ (5/5)
  • Action: APPROVE
  • Reasoning: Textbook performance fix addressing a real regression with proper root cause analysis. Database indexes are correctly ordered for access patterns, optimization correctly identifies structural properties of parameter/property nodes, no correctness risks, comprehensive test validation, and directly improves AI agent experience by preventing query slowdowns.

Excellent work - this is exactly how performance regressions should be addressed: measure the bottleneck, fix the underlying issue with proper database design, and add algorithmic optimizations based on domain knowledge.


@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 26, 2026

Greptile Summary

This PR addresses a query regression caused by graph growth from the TypeScript migration by adding two compound SQLite indexes and skipping ~4,100 leaf nodes from expensive fan-in/fan-out computations.\n\n- Migration v16 adds idx_edges_kind_target ON edges(kind, target_id) and idx_edges_kind_source ON edges(kind, source_id). The column order is intentional and correct: kind is the equality filter (WHERE kind = 'calls'), so placing it first allows SQLite to narrow the scan to only 'calls' edges before resolving target_id/source_id. Both GROUP BY subqueries in the full path and correlated subqueries in the incremental path benefit directly.\n- classifyNodeRolesFull and classifyNodeRolesIncremental now pre-classify parameter and property nodes as dead-leaf with a cheap scan, then exclude them from the heavy JOIN/subquery path. The idsByRole map is correctly pre-seeded before the main rows loop — callable nodes that the classifier also resolves to dead-leaf push into the same backing array rather than overwriting it. The early-return guard is widened to rows.length === 0 && leafRows.length === 0, preventing a silent no-op when a file contains only leaf-kind nodes.\n- All logic paths (full, incremental, zero-rows edge case) look correct. The previously noted over-fetching of name/kind/file in leafRows was already resolved in 6c8cc17 per the prior thread.

Confidence Score: 5/5

Safe to merge — targeted performance fix with correct index design, no logic regressions, and all 2,128 tests passing.

Both changes are well-scoped: the migration uses IF NOT EXISTS and is append-only, and the classifyNodeRoles refactor is structurally sound (correct map accumulation, correct early-exit guard, correct incremental scoping to allAffectedFiles). The prior review concern about unused columns in leafRows was already resolved. No blocking issues found.

No files require special attention.

Important Files Changed

Filename Overview
src/db/migrations.ts Adds migration v16 with two compound indexes on the edges table — (kind, target_id) and (kind, source_id) — ordered correctly (equality filter kind first) so SQLite can skip-scan 'calls' edges without full table scans.
src/features/structure.ts Both full and incremental classification paths now short-circuit parameter/property nodes as dead-leaf before the expensive fan-in/fan-out JOINs/subqueries; the idsByRole map correctly accumulates callable nodes that also resolve to dead-leaf into the same array, and the early-exit guard is correctly widened to rows.length === 0 && leafRows.length === 0.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[classifyNodeRoles] --> B{changedFiles?}
    B -- Yes --> C[classifyNodeRolesIncremental]
    B -- No --> D[classifyNodeRolesFull]

    D --> D1["leafRows: SELECT id WHERE kind IN ('parameter','property')"]
    D --> D2["rows: LEFT JOIN fan-in/fan-out\nWHERE kind NOT IN (...,'parameter','property')\n uses idx_edges_kind_target/source"]
    D1 --> D3["idsByRole.set('dead-leaf', leafIds)"]
    D2 --> D4["classifyRoles → roleMap"]
    D4 --> D5["rows loop: append to idsByRole\n(dead-leaf entry already exists → push)"]
    D3 --> D6["db.transaction: UPDATE nodes SET role=NULL\nthen batch UPDATE per role chunk"]
    D5 --> D6

    C --> C1["leafRows: SELECT id WHERE kind IN ('parameter','property') AND file IN (...)"]
    C --> C2["rows: correlated subqueries fan-in/fan-out\nWHERE kind NOT IN (...) AND file IN (...)\n uses idx_edges_kind_target/source"]
    C1 --> C3["idsByRole.set('dead-leaf', leafIds)"]
    C2 --> C4["classifyRoles → roleMap"]
    C4 --> C5["rows loop: append to idsByRole"]
    C3 --> C6["db.transaction: UPDATE affected files only\nthen batch UPDATE per role chunk"]
    C5 --> C6
Loading

Reviews (3): Last reviewed commit: "fix: replace corrupted UTF-8 character i..." | Re-trigger Greptile

Comment on lines +422 to +428
const leafRows = db
.prepare(
`SELECT n.id, n.name, n.kind, n.file
FROM nodes n
WHERE n.kind IN ('parameter', 'property')`,
)
.all() as { id: number; name: string; kind: string; file: string }[];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unused columns in full-path leafRows query

name, kind, and file are fetched here but only id is ever accessed in the processing loop below (for (const row of leafRows) leafIds.push(row.id)). The incremental path correctly fetches only SELECT n.id — the full path can match that.

Suggested change
const leafRows = db
.prepare(
`SELECT n.id, n.name, n.kind, n.file
FROM nodes n
WHERE n.kind IN ('parameter', 'property')`,
)
.all() as { id: number; name: string; kind: string; file: string }[];
const leafRows = db
.prepare(
`SELECT n.id
FROM nodes n
WHERE n.kind IN ('parameter', 'property')`,
)
.all() as { id: number }[];

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6c8cc17. The full-path leafRows query now selects only n.id, matching the lean form already used by the incremental path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6c8cc17. The full-path leafRows query now selects only n.id, matching the lean form already used by the incremental path.

The full-path leafRows query selected name, kind, and file columns
that were never accessed — only id is used. Align with the incremental
path which already uses the lean SELECT n.id form.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm
Copy link
Copy Markdown
Contributor Author

Fixed the corrupted UTF-8 replacement character in the incremental classifier comment (line 600 of src/features/structure.ts) noted in the Greptile summary. Replaced with a proper em-dash .

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit c61e825 into main Mar 26, 2026
12 checks passed
@carlos-alm carlos-alm deleted the perf/benchmark-regression-indexes branch March 26, 2026 21:36
@github-actions github-actions bot locked and limited conversation to collaborators Mar 26, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant