Skip to content

perf: database access optimizations (follow-ups from #7694) #7696

@MarkusNeusinger

Description

@MarkusNeusinger

Context

While fixing #7694 (slow NumbersStrip + slow /specs), I traced a few backend hot paths and noticed several DB-access patterns that are worth tackling in follow-ups. None are blockers on their own — #7694 already neutralized the user-visible symptom via aggregate queries on /stats, a fast-path on /plots/filter, and a startup prewarm — but each is a real performance smell and would compound if the catalog keeps growing.

Filing this as a tracking issue so the cluster doesn't get lost.

Concrete leads

1. /libraries/{library_id}/images over-fetches by ~13 MB per call

api/routers/libraries.py:71-112 loads every spec's code via SpecRepository.get_all_with_code() and then filters in Python for the one requested library. The Impl.code column is multi-MB total across all rows (see model comment at core/database/models.py:128), and we never need it on this endpoint.

Fix shape: add ImplRepository.get_preview_by_library(library_id) that selects only (spec_id, library_id, language_id, preview_url, preview_html) for that one library; replace the get_all_with_code() call. Add the same stale-while-revalidate pattern the other metadata endpoints use (it currently only has a plain set_cache/get_cache, no refresh_after).

2. /plots/filter still does get_all() even on cache hit

api/routers/plots.py:436-481 calls SpecRepository.get_all() (selectinload impls + library, hundreds of specs × dozen impls) on every cold-cache request, even ones that only differ from filter:all by one tag.

Fix shape: cache the heavy intermediate (all_images + spec_lookup + impl_lookup + global_counts) once under a stable key, and have the per-filter request only do the filtering work on top. The cache key churn (many filter combinations) currently means every new combo pays the full DB roundtrip.

3. or_counts is O(groups × images × categories) in Python

_calculate_or_counts in api/routers/plots.py:193-242 re-filters the entire image set once per active filter group, then re-aggregates 12 category counts. This is fine for ~300 specs but won't scale linearly past that, and it's all done on the request thread.

Fix shape: materialize per-spec / per-impl tag bags as denormalized rows in a small auxiliary table (or a SQL view), or precompute counts in a worker after each sync and serve from the cache.

4. get_total_code_lines scans every code blob on every cold cache

core/database/repositories.py:283-290 is SUM(LENGTH(code) - LENGTH(REPLACE(code, E'\n', ''))) across all rows. Even though /stats now uses lightweight COUNTs, this single aggregate still touches the deferred code column for every impl.

Fix shape: add a generated/computed Impl.line_count column maintained on insert/update (or persist it in the sync pipeline), so /stats is SELECT SUM(line_count).

5. Foreign-key indexes

Worth verifying with \d impls in psql: PostgreSQL does not auto-create indexes on FK columns. impls.spec_id, impls.library_id, impls.language_id, and feedback.spec_id are all queried by these columns. Missing indexes would surface as seq-scans on /specs/{id} and the new count_with_impls aggregate.

If absent, add via an Alembic migration.

6. Per-instance cache, no shared layer

The in-memory TTLCache in api/cache.py is per Cloud Run instance. The startup prewarm from #7694 covers the cold-start case, but autoscale to multiple instances still means each one starts cold until it gets its first request for each cached endpoint.

Fix shape (only if Cloud Monitoring shows it matters): Memorystore/Redis for the small-volume metadata caches (stats, libraries, languages, specs_list, specs_map). Skip for /plots/filter — payload size makes the network round-trip not worth it.

7. Connection pool tuning

core/database/connection.py uses SQLAlchemy defaults (pool_size=5, max_overflow=10). Not visibly a problem today, but worth confirming with metrics under burst load — every parallel AppDataProvider request currently fires four queries simultaneously per visitor.

Out of scope

  • /insights/* endpoints (separate hot path, separate issue if needed)
  • MCP server queries — they only run for the model integration, not for site traffic
  • Database schema redesign — keep migrations narrowly scoped to the items above

Priority suggestion

  1. (Add workflow diagram for new prototype discovery #1) /libraries/{id}/images — biggest absolute win, no risk
  2. (#5) FK indexes — cheap to verify, possibly free
  3. (Add Claude Code GitHub Workflow #2) /plots/filter cache split — bigger refactor, biggest payoff on cold catalog growth
  4. (feat: add complete automation infrastructure for plot generation #4) precomputed line_count — only matters if /stats cold-cache stays slow after the others land
  5. (Add Claude Code GitHub Workflow #3) tag denormalization — only if catalog grows past ~1k specs
  6. (Create test issue and merge via GitHub #6, #7) shared cache + pool tuning — defer until metrics demand it

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestinfrastructureWorkflow, backend, or frontend issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions