fix(docs): apply api-nr selectors to nargo-doc pages by critesjosh · Pull Request #23049 · AztecProtocol/aztec-packages

critesjosh · 2026-05-07T16:14:37Z

Summary

Follow-up to #23042. That PR restored the overall search index from 48 records back to ~12,360 records, but the underlying goal of #22861 — making Aztec.nr API pages searchable — is still not actually working: all 2,222 crawled aztec-nr-api/mainnet/... URLs emit `0 records` in the DocSearch summary.

Root cause

The docsearch-scraper resolves a URL's `selectors_key` in `abstract_strategy.py` `get_selectors_set_key()`: it walks `start_urls` in declaration order, matches each with `re.search` (a substring search, not a prefix anchor), and breaks on the first match.

Our config listed the homepage start_url first:

```json
"start_urls": [
{"url": "https://docs.aztec.network/", "page_rank": 10},
{"url": "https://docs.aztec.network/aztec-nr-api/mainnet/", "selectors_key": "api-nr", "page_rank": 2}
]
```

Because `"https://docs.aztec.network/"\` is a substring of every aztec-nr-api URL, the homepage entry always matched first — so every API URL was assigned `selectors_key: "default"` and the `api-nr` selectors were never used.

The default selectors target Docusaurus markup (`header h1`, `article p`, `menu__list ... active` XPath); none of those nodes exist on rustdoc-style nargo-doc pages, so the scraper found nothing and emitted zero records on every API page.

Fix

Swap the order so the more-specific aztec-nr-api start_url is matched first:

```json
"start_urls": [
{"url": "https://docs.aztec.network/aztec-nr-api/mainnet/", "selectors_key": "api-nr", "page_rank": 2},
{"url": "https://docs.aztec.network/", "page_rank": 10}
]
```

Now `/aztec-nr-api/mainnet/...` URLs hit the api-nr entry; everything else falls through to the homepage entry. This also matches the standard docsearch convention of listing the most-specific URL first.

Test plan

Manually dispatch the `Docs Scraper` workflow on this branch via `workflow_dispatch`. Confirm a non-trivial fraction of `aztec-nr-api/mainnet/...` lines in the DocSearch summary report `> 0 records`.
Confirm the overall `Nb hits` stays comfortably above the 5,000 threshold and ideally lands meaningfully above the previous run's 12,360.
After merge, search docs.aztec.network for an Aztec.nr identifier (e.g. `ContractClassId`, `balance_set`, `compute_log_tag`) and confirm the API reference pages appear in results.

The follow-up to #23042: that PR fixed the indexing rate-limit problem but every aztec-nr-api page still emitted 0 records. Root cause: the docsearch-scraper resolves a URL's selectors_key by walking start_urls in order and matching with `re.search` (substring), breaking on first match. With the homepage URL listed first, every aztec-nr-api URL matched it (since "https://docs.aztec.network/" is a substring of every aztec-nr-api URL) and was assigned the default selectors. The default selectors target Docusaurus-only markup (`header h1`, `article p`, `menu__list ... active` XPath), none of which exist on rustdoc-style nargo-doc pages, so the scraper found no nodes and emitted no records. Fix: list the more-specific aztec-nr-api start_url first so it wins the selectors_key match for those URLs. The homepage start_url then serves as the catch-all for everything else. Reference: scraper/src/strategies/abstract_strategy.py get_selectors_set_key() iterates start_urls in declaration order and breaks on the first re.search hit.

AztecBot · 2026-05-07T16:54:14Z

Flakey Tests

🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry.

\033FLAKED\033 (8;;http://ci.aztec-labs.com/dad455c5988b15ad�dad455c5988b15ad8;;�): yarn-project/kv-store/scripts/run_test.sh src/indexeddb/multi_map.test.ts (6s) (code: 0)

Despite the previous fixes (AztecProtocol#23042, AztecProtocol#23049) restoring 14,773 records from 2,222 aztec-nr-api pages, those records still don't surface in the docs site search. Root cause: docusaurus-theme-search-typesense ANDs `language:=en && docusaurus_tag:=[<plugin-context-tag>]` into every search query. The api-nr records lack both attributes because the typesense-docsearch-scraper only stamps them onto records scraped from a docusaurus-rendered page (via `<meta name="docsearch:...">` tags); rustdoc-style nargo-doc pages don't emit those metas. Two changes: 1. Add `extra_attributes` on the api-nr start_url so each api-nr record gets `language: "en"` and a `docusaurus_tag` array spanning every plugin context (`docs-developer-v4.2.0`, `docs-network-v4.2.0`, `docs-participate-current`, `docs-root-current`, `default`). Typesense's `:=[...]` array-match then succeeds in any context the user is searching from. 2. Add `field_definitions` to `custom_settings`. The typesense-docsearch-scraper's default schema declares the wildcard `.*_tag` as `string`, so passing an array for `docusaurus_tag` would be rejected at import. `field_definitions` overrides the wildcard for `docusaurus_tag` specifically with `string*`, which accepts both a single string (existing docusaurus records, set from a meta tag) and an array (api-nr records). The scraper replaces the entire default schema when `field_definitions` is set, so the full default field list is reproduced verbatim with only the `docusaurus_tag` entry inserted before the `.*_tag` wildcard. Note: the version-specific tag values (`v4.2.0`) need to be updated when mainnet/testnet bump versions. Future improvement: derive these from `developer_version_config.json` and `network_version_config.json` at scrape time.

…ztecProtocol#23058) ## Summary Third in the series fixing search after AztecProtocol#22861. Previous PRs (AztecProtocol#23042, AztecProtocol#23049) successfully indexed 14,773 records from 2,222 aztec-nr-api pages, but **users still don't see those records in the dropdown search**. ## Root cause The `docusaurus-theme-search-typesense` package ANDs a contextual filter into every search query: ```ts // docusaurus-theme-search-typesense/src/client/useTypesenseContextualFacetFilters.ts const languageFilter = `language:=${locale}`; const tagsFilter = `docusaurus_tag:=[${tags.join(',')}]`; return [languageFilter, tagsFilter].filter(Boolean).join(' && '); ``` The api-nr records have neither `language` nor `docusaurus_tag` set, because the typesense-docsearch-scraper only stamps those onto records scraped from docusaurus pages (it reads `<meta name="docsearch:docusaurus_tag" content="...">` tags). Rustdoc-style nargo-doc pages don't emit those metas, so every api-nr record is missing the fields the theme filters on, so every api-nr record is filtered out of every dropdown query. ## Fix Three coordinated changes: ### 1. `extra_attributes` on the api-nr start_url (`docs/typesense.config.json`) Stamp every api-nr record with the attributes the theme expects: ```json "extra_attributes": { "language": "en", "docusaurus_tag": [ "docs-participate-current", "docs-root-current", "default" ] } ``` These cover the three unversioned plugin contexts. The two versioned ones (`docs-developer-${version}` and `docs-network-${version}`) are appended dynamically by the workflow (see AztecProtocol#3) so the static config doesn't go stale on version bumps. Typesense's `docusaurus_tag:=[<context-tag>]` matches if the record's array contains the context tag, so the api-nr records will satisfy the filter from any plugin context. ### 2. `field_definitions` schema override (`docs/typesense.config.json`) The scraper's default schema (`scraper/src/typesense_helper.py` v0.11.0) declares the wildcard `.*_tag` as `string`, so sending an array for `docusaurus_tag` would be rejected at import time. `field_definitions` overrides this — but it REPLACES the entire default schema rather than merging, so the full default field list is reproduced verbatim with one targeted change: `docusaurus_tag` is added with type `string*` (accepts both string and array) before the `.*_tag` wildcard. Existing docusaurus records continue to work because they pass `docusaurus_tag` as a single string from a meta tag, and `string*` accepts that too. ### 3. Derive versioned tags at scrape time (`.github/workflows/docs-typesense.yml`) Read `developer_version_config.json` and `network_version_config.json`, build the `docs-developer-${mainnet}`, `docs-developer-${testnet}`, `docs-network-${mainnet}`, `docs-network-${testnet}` strings (dropping empty/duplicates), and use `jq` to append them to the api-nr start_url's `docusaurus_tag` array before passing the config to the scraper. This way the static JSON never holds version-specific values that need manual updating. The workflow run also switches to `set -euo pipefail` so a `jq` derivation failure aborts the run rather than feeding an empty config to docker. ## Caveats - Existing 14,773 api-nr records in the production collection are stale until the next scraper run rewrites them. The scraper alias-swaps to a fresh collection on each run, so no manual purge is needed. ## Test plan - [ ] Manually dispatch `Docs Scraper` workflow on this branch via `workflow_dispatch`. - [ ] Confirm scraper run reports `Nb hits` ≈ 27,000 (no regression in record count). - [ ] Confirm no schema-validation errors in the run log. - [ ] Confirm the workflow log echoes the derived `docusaurus_tag` values matching the current docs versions (e.g. `docs-developer-v4.2.0`, `docs-network-v4.2.0`). - [ ] After merge, search docs.aztec.network from the homepage, /developers/, and /operate/ for an Aztec.nr identifier (e.g. `ContractClassId`, `balance_set`, `compute_log_tag`) and confirm API reference pages appear in the dropdown in all three contexts.

…tecProtocol#23097) ## Summary Fourth in the series fixing search after AztecProtocol#22861. After AztecProtocol#23058 merged, the production index still has **0 records under `aztec-nr-api/mainnet/...`**. Confirmed by querying the live Typesense collection directly (`filter_by:url:=https://docs.aztec.network/aztec-nr-api/mainnet/*` returns `found: 0`) and by inspecting the most recent scraper run logs. ## Root cause The schema override added by AztecProtocol#23058 doesn't take effect. Every api-nr document import is rejected by Typesense with HTTP 400 `'Field \`docusaurus_tag\` must be a string.'`, even though `custom_settings.field_definitions` lists an explicit `{ \"name\": \"docusaurus_tag\", \"type\": \"string*\" }` ahead of the wildcard `.*_tag: string`. Per Typesense docs an explicit field should win over a regex pattern field, but in practice the wildcard's `string` type appears to be what's enforced. The CI guard from AztecProtocol#23042 (`MIN_HITS=5000`) didn't trip because the ~12k non-api docs still passed. ## Fix The PR over-engineered the solution. Reading the docusaurus theme: ```ts // docusaurus-theme-common/src/utils/searchUtils.ts export const DEFAULT_SEARCH_TAG = 'default'; ``` ```ts // docusaurus-theme-common/src/index.ts const tags = [DEFAULT_SEARCH_TAG, ...docsTags]; return {locale: i18n.currentLocale, tags}; ``` …the theme unconditionally prepends `'default'` to the `docusaurus_tag` filter on every dropdown query, in every plugin context. So api-nr records only need the single scalar value `\"default\"` to satisfy the filter from anywhere on the docs site. No array, no schema surgery, no version-specific tag derivation. Three changes: ### 1. `docs/typesense.config.json` Drop the `custom_settings.field_definitions` override entirely (the scraper's default schema with `.*_tag: string` accepts scalar string values cleanly), and collapse the api-nr `extra_attributes.docusaurus_tag` to scalar `\"default\"`. ### 2. `.github/workflows/docs-typesense.yml` — remove jq mutation The jq block that derived versioned tags is no longer needed. The scraper now reads `docs/typesense.config.json` verbatim. ### 3. `.github/workflows/docs-typesense.yml` — log api-nr visibility post-index After the scraper completes its alias swap, curl the live `aztec-docs` alias for `docusaurus_tag:=[default]&&language:=en` and log the count. No existing docusaurus page carries the `\"default\"` tag (each is stamped with its plugin-context tag, e.g. `docs-developer-v4.2.0`, from the `<meta name=\"docsearch:docusaurus_tag\">` tag), so this count is effectively the count of indexed api-nr records — and the filter mirrors what the theme actually sends. Informational only; not gated by a threshold. ## Behavior change api-nr records will now appear in the search dropdown from every plugin context (developer, network, root, participate) and every doc version (mainnet, testnet, nightly), because they're stamped with the always-prepended `\"default\"` tag rather than version-specific tags. Today we only generate `aztec-nr-api/mainnet/`, so a user browsing testnet developer docs would see mainnet aztec-nr API links in their dropdown. Probably desirable (an aztec-nr API symbol is the same regardless of which doc version you're reading), but a behavior change vs the (non-functional) AztecProtocol#23058 attempt. ## Caveat api-nr visibility now depends on the docusaurus theme's `DEFAULT_SEARCH_TAG = 'default'` invariant. If a future caller ever issues a search query that doesn't include `'default'` in the tag list (e.g. a custom search page bypassing `useContextualSearchFilters`), api-nr records would silently disappear from that surface. ## Test plan - [ ] Manually dispatch `Docs Scraper` workflow via `workflow_dispatch` on this branch. - [ ] Confirm the run logs `Indexed N records (threshold: 5000)` with N >> 5000. - [ ] Confirm the run logs `api-nr records visible under docusaurus_tag:=[default]: M` with M well above zero (AztecProtocol#23049 indexed 14,773 api-nr records before the schema rejection started silently dropping them, so we expect a similar count). - [ ] Confirm no `'Field \`docusaurus_tag\` must be a string.'` 400s in the scraper output. - [ ] After merge, search docs.aztec.network from the homepage, /developers/, /network/, and /participate/ for an Aztec.nr identifier (e.g. `ContractClassId`, `balance_set`, `compute_log_tag`, `address_note`) and confirm API reference pages appear in the dropdown in all four contexts.

critesjosh requested review from alejoamiras and ciaranightingale May 7, 2026 16:22

alejoamiras approved these changes May 7, 2026

View reviewed changes

critesjosh added this pull request to the merge queue May 7, 2026

Merged via the queue into next with commit 5951e05 May 7, 2026
22 checks passed

critesjosh deleted the josh/fix-typesense-api-selectors branch May 7, 2026 17:07

This was referenced May 7, 2026

fix(docs): tag aztec-nr-api records so they pass the search filter #23058

Merged

fix(docs): use scalar docusaurus_tag "default" for api-nr records #23097

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(docs): apply api-nr selectors to nargo-doc pages#23049

fix(docs): apply api-nr selectors to nargo-doc pages#23049
critesjosh merged 1 commit into
nextfrom
josh/fix-typesense-api-selectors

critesjosh commented May 7, 2026

Uh oh!

AztecBot commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

critesjosh commented May 7, 2026

Summary

Root cause

Fix

Test plan

Uh oh!

AztecBot commented May 7, 2026

Flakey Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants