Skip to content

fix(docs): apply api-nr selectors to nargo-doc pages#23049

Merged
critesjosh merged 1 commit into
nextfrom
josh/fix-typesense-api-selectors
May 7, 2026
Merged

fix(docs): apply api-nr selectors to nargo-doc pages#23049
critesjosh merged 1 commit into
nextfrom
josh/fix-typesense-api-selectors

Conversation

@critesjosh

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #23042. That PR restored the overall search index from 48 records back to ~12,360 records, but the underlying goal of #22861 — making Aztec.nr API pages searchable — is still not actually working: all 2,222 crawled aztec-nr-api/mainnet/... URLs emit `0 records` in the DocSearch summary.

Root cause

The docsearch-scraper resolves a URL's `selectors_key` in `abstract_strategy.py` `get_selectors_set_key()`: it walks `start_urls` in declaration order, matches each with `re.search` (a substring search, not a prefix anchor), and breaks on the first match.

Our config listed the homepage start_url first:

```json
"start_urls": [
{"url": "https://docs.aztec.network/", "page_rank": 10},
{"url": "https://docs.aztec.network/aztec-nr-api/mainnet/", "selectors_key": "api-nr", "page_rank": 2}
]
```

Because `"https://docs.aztec.network/"\` is a substring of every aztec-nr-api URL, the homepage entry always matched first — so every API URL was assigned `selectors_key: "default"` and the `api-nr` selectors were never used.

The default selectors target Docusaurus markup (`header h1`, `article p`, `menu__list ... active` XPath); none of those nodes exist on rustdoc-style nargo-doc pages, so the scraper found nothing and emitted zero records on every API page.

Fix

Swap the order so the more-specific aztec-nr-api start_url is matched first:

```json
"start_urls": [
{"url": "https://docs.aztec.network/aztec-nr-api/mainnet/", "selectors_key": "api-nr", "page_rank": 2},
{"url": "https://docs.aztec.network/", "page_rank": 10}
]
```

Now `/aztec-nr-api/mainnet/...` URLs hit the api-nr entry; everything else falls through to the homepage entry. This also matches the standard docsearch convention of listing the most-specific URL first.

Test plan

  • Manually dispatch the `Docs Scraper` workflow on this branch via `workflow_dispatch`. Confirm a non-trivial fraction of `aztec-nr-api/mainnet/...` lines in the DocSearch summary report `> 0 records`.
  • Confirm the overall `Nb hits` stays comfortably above the 5,000 threshold and ideally lands meaningfully above the previous run's 12,360.
  • After merge, search docs.aztec.network for an Aztec.nr identifier (e.g. `ContractClassId`, `balance_set`, `compute_log_tag`) and confirm the API reference pages appear in results.

The follow-up to #23042: that PR fixed the indexing rate-limit problem
but every aztec-nr-api page still emitted 0 records. Root cause: the
docsearch-scraper resolves a URL's selectors_key by walking start_urls
in order and matching with `re.search` (substring), breaking on first
match. With the homepage URL listed first, every aztec-nr-api URL
matched it (since "https://docs.aztec.network/" is a substring of every
aztec-nr-api URL) and was assigned the default selectors. The default
selectors target Docusaurus-only markup (`header h1`, `article p`,
`menu__list ... active` XPath), none of which exist on rustdoc-style
nargo-doc pages, so the scraper found no nodes and emitted no records.

Fix: list the more-specific aztec-nr-api start_url first so it wins
the selectors_key match for those URLs. The homepage start_url then
serves as the catch-all for everything else.

Reference: scraper/src/strategies/abstract_strategy.py
get_selectors_set_key() iterates start_urls in declaration order and
breaks on the first re.search hit.
@critesjosh critesjosh added this pull request to the merge queue May 7, 2026
@AztecBot

AztecBot commented May 7, 2026

Copy link
Copy Markdown
Collaborator

Flakey Tests

🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry.

\033FLAKED\033 (8;;http://ci.aztec-labs.com/dad455c5988b15ad�dad455c5988b15ad8;;�): yarn-project/kv-store/scripts/run_test.sh src/indexeddb/multi_map.test.ts (6s) (code: 0)

Merged via the queue into next with commit 5951e05 May 7, 2026
22 checks passed
@critesjosh critesjosh deleted the josh/fix-typesense-api-selectors branch May 7, 2026 17:07
rangozd pushed a commit to rangozd/aztec-packages that referenced this pull request May 16, 2026
Despite the previous fixes (AztecProtocol#23042, AztecProtocol#23049) restoring 14,773 records
from 2,222 aztec-nr-api pages, those records still don't surface in
the docs site search. Root cause: docusaurus-theme-search-typesense
ANDs `language:=en && docusaurus_tag:=[<plugin-context-tag>]` into
every search query. The api-nr records lack both attributes because
the typesense-docsearch-scraper only stamps them onto records scraped
from a docusaurus-rendered page (via `<meta name="docsearch:...">`
tags); rustdoc-style nargo-doc pages don't emit those metas.

Two changes:

1. Add `extra_attributes` on the api-nr start_url so each api-nr
   record gets `language: "en"` and a `docusaurus_tag` array spanning
   every plugin context (`docs-developer-v4.2.0`,
   `docs-network-v4.2.0`, `docs-participate-current`,
   `docs-root-current`, `default`). Typesense's `:=[...]` array-match
   then succeeds in any context the user is searching from.

2. Add `field_definitions` to `custom_settings`. The
   typesense-docsearch-scraper's default schema declares the wildcard
   `.*_tag` as `string`, so passing an array for `docusaurus_tag`
   would be rejected at import. `field_definitions` overrides the
   wildcard for `docusaurus_tag` specifically with `string*`, which
   accepts both a single string (existing docusaurus records, set
   from a meta tag) and an array (api-nr records). The scraper
   replaces the entire default schema when `field_definitions` is
   set, so the full default field list is reproduced verbatim with
   only the `docusaurus_tag` entry inserted before the `.*_tag`
   wildcard.

Note: the version-specific tag values (`v4.2.0`) need to be updated
when mainnet/testnet bump versions. Future improvement: derive these
from `developer_version_config.json` and `network_version_config.json`
at scrape time.
rangozd pushed a commit to rangozd/aztec-packages that referenced this pull request May 16, 2026
…ztecProtocol#23058)

## Summary

Third in the series fixing search after AztecProtocol#22861. Previous PRs (AztecProtocol#23042,
AztecProtocol#23049) successfully indexed 14,773 records from 2,222 aztec-nr-api
pages, but **users still don't see those records in the dropdown
search**.

## Root cause

The `docusaurus-theme-search-typesense` package ANDs a contextual filter
into every search query:

```ts
// docusaurus-theme-search-typesense/src/client/useTypesenseContextualFacetFilters.ts
const languageFilter = `language:=${locale}`;
const tagsFilter = `docusaurus_tag:=[${tags.join(',')}]`;
return [languageFilter, tagsFilter].filter(Boolean).join(' && ');
```

The api-nr records have neither `language` nor `docusaurus_tag` set,
because the typesense-docsearch-scraper only stamps those onto records
scraped from docusaurus pages (it reads `<meta
name="docsearch:docusaurus_tag" content="...">` tags). Rustdoc-style
nargo-doc pages don't emit those metas, so every api-nr record is
missing the fields the theme filters on, so every api-nr record is
filtered out of every dropdown query.

## Fix

Three coordinated changes:

### 1. `extra_attributes` on the api-nr start_url
(`docs/typesense.config.json`)

Stamp every api-nr record with the attributes the theme expects:

```json
"extra_attributes": {
  "language": "en",
  "docusaurus_tag": [
    "docs-participate-current",
    "docs-root-current",
    "default"
  ]
}
```

These cover the three unversioned plugin contexts. The two versioned
ones (`docs-developer-${version}` and `docs-network-${version}`) are
appended dynamically by the workflow (see AztecProtocol#3) so the static config
doesn't go stale on version bumps.

Typesense's `docusaurus_tag:=[<context-tag>]` matches if the record's
array contains the context tag, so the api-nr records will satisfy the
filter from any plugin context.

### 2. `field_definitions` schema override
(`docs/typesense.config.json`)

The scraper's default schema (`scraper/src/typesense_helper.py` v0.11.0)
declares the wildcard `.*_tag` as `string`, so sending an array for
`docusaurus_tag` would be rejected at import time. `field_definitions`
overrides this — but it REPLACES the entire default schema rather than
merging, so the full default field list is reproduced verbatim with one
targeted change: `docusaurus_tag` is added with type `string*` (accepts
both string and array) before the `.*_tag` wildcard. Existing docusaurus
records continue to work because they pass `docusaurus_tag` as a single
string from a meta tag, and `string*` accepts that too.

### 3. Derive versioned tags at scrape time
(`.github/workflows/docs-typesense.yml`)

Read `developer_version_config.json` and `network_version_config.json`,
build the `docs-developer-${mainnet}`, `docs-developer-${testnet}`,
`docs-network-${mainnet}`, `docs-network-${testnet}` strings (dropping
empty/duplicates), and use `jq` to append them to the api-nr start_url's
`docusaurus_tag` array before passing the config to the scraper. This
way the static JSON never holds version-specific values that need manual
updating.

The workflow run also switches to `set -euo pipefail` so a `jq`
derivation failure aborts the run rather than feeding an empty config to
docker.

## Caveats

- Existing 14,773 api-nr records in the production collection are stale
until the next scraper run rewrites them. The scraper alias-swaps to a
fresh collection on each run, so no manual purge is needed.

## Test plan

- [ ] Manually dispatch `Docs Scraper` workflow on this branch via
`workflow_dispatch`.
- [ ] Confirm scraper run reports `Nb hits` ≈ 27,000 (no regression in
record count).
- [ ] Confirm no schema-validation errors in the run log.
- [ ] Confirm the workflow log echoes the derived `docusaurus_tag`
values matching the current docs versions (e.g. `docs-developer-v4.2.0`,
`docs-network-v4.2.0`).
- [ ] After merge, search docs.aztec.network from the homepage,
/developers/, and /operate/ for an Aztec.nr identifier (e.g.
`ContractClassId`, `balance_set`, `compute_log_tag`) and confirm API
reference pages appear in the dropdown in all three contexts.
rangozd pushed a commit to rangozd/aztec-packages that referenced this pull request May 16, 2026
…tecProtocol#23097)

## Summary

Fourth in the series fixing search after AztecProtocol#22861. After AztecProtocol#23058 merged,
the production index still has **0 records under
`aztec-nr-api/mainnet/...`**. Confirmed by querying the live Typesense
collection directly
(`filter_by:url:=https://docs.aztec.network/aztec-nr-api/mainnet/*`
returns `found: 0`) and by inspecting the most recent scraper run logs.

## Root cause

The schema override added by AztecProtocol#23058 doesn't take effect. Every api-nr
document import is rejected by Typesense with HTTP 400 `'Field
\`docusaurus_tag\` must be a string.'`, even though
`custom_settings.field_definitions` lists an explicit `{ \"name\":
\"docusaurus_tag\", \"type\": \"string*\" }` ahead of the wildcard
`.*_tag: string`. Per Typesense docs an explicit field should win over a
regex pattern field, but in practice the wildcard's `string` type
appears to be what's enforced. The CI guard from AztecProtocol#23042
(`MIN_HITS=5000`) didn't trip because the ~12k non-api docs still
passed.

## Fix

The PR over-engineered the solution. Reading the docusaurus theme:

```ts
// docusaurus-theme-common/src/utils/searchUtils.ts
export const DEFAULT_SEARCH_TAG = 'default';
```

```ts
// docusaurus-theme-common/src/index.ts
const tags = [DEFAULT_SEARCH_TAG, ...docsTags];
return {locale: i18n.currentLocale, tags};
```

…the theme unconditionally prepends `'default'` to the `docusaurus_tag`
filter on every dropdown query, in every plugin context. So api-nr
records only need the single scalar value `\"default\"` to satisfy the
filter from anywhere on the docs site. No array, no schema surgery, no
version-specific tag derivation.

Three changes:

### 1. `docs/typesense.config.json`

Drop the `custom_settings.field_definitions` override entirely (the
scraper's default schema with `.*_tag: string` accepts scalar string
values cleanly), and collapse the api-nr
`extra_attributes.docusaurus_tag` to scalar `\"default\"`.

### 2. `.github/workflows/docs-typesense.yml` — remove jq mutation

The jq block that derived versioned tags is no longer needed. The
scraper now reads `docs/typesense.config.json` verbatim.

### 3. `.github/workflows/docs-typesense.yml` — log api-nr visibility
post-index

After the scraper completes its alias swap, curl the live `aztec-docs`
alias for `docusaurus_tag:=[default]&&language:=en` and log the count.
No existing docusaurus page carries the `\"default\"` tag (each is
stamped with its plugin-context tag, e.g. `docs-developer-v4.2.0`, from
the `<meta name=\"docsearch:docusaurus_tag\">` tag), so this count is
effectively the count of indexed api-nr records — and the filter mirrors
what the theme actually sends. Informational only; not gated by a
threshold.

## Behavior change

api-nr records will now appear in the search dropdown from every plugin
context (developer, network, root, participate) and every doc version
(mainnet, testnet, nightly), because they're stamped with the
always-prepended `\"default\"` tag rather than version-specific tags.
Today we only generate `aztec-nr-api/mainnet/`, so a user browsing
testnet developer docs would see mainnet aztec-nr API links in their
dropdown. Probably desirable (an aztec-nr API symbol is the same
regardless of which doc version you're reading), but a behavior change
vs the (non-functional) AztecProtocol#23058 attempt.

## Caveat

api-nr visibility now depends on the docusaurus theme's
`DEFAULT_SEARCH_TAG = 'default'` invariant. If a future caller ever
issues a search query that doesn't include `'default'` in the tag list
(e.g. a custom search page bypassing `useContextualSearchFilters`),
api-nr records would silently disappear from that surface.

## Test plan

- [ ] Manually dispatch `Docs Scraper` workflow via `workflow_dispatch`
on this branch.
- [ ] Confirm the run logs `Indexed N records (threshold: 5000)` with N
>> 5000.
- [ ] Confirm the run logs `api-nr records visible under
docusaurus_tag:=[default]: M` with M well above zero (AztecProtocol#23049 indexed
14,773 api-nr records before the schema rejection started silently
dropping them, so we expect a similar count).
- [ ] Confirm no `'Field \`docusaurus_tag\` must be a string.'` 400s in
the scraper output.
- [ ] After merge, search docs.aztec.network from the homepage,
/developers/, /network/, and /participate/ for an Aztec.nr identifier
(e.g. `ContractClassId`, `balance_set`, `compute_log_tag`,
`address_note`) and confirm API reference pages appear in the dropdown
in all four contexts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants