Skip to content

llms.txt aggregate walker only descends one level, undercounts deeply-nested indexes #57

@SahilAujla

Description

@SahilAujla

Context

The aggregate .txt walker in src/helpers/get-page-urls.ts (walkAggregateLinks) only descends one level into nested llms.txt indexes. Sub-link .txt references at depth 2+ are explicitly filtered out (see “skip further .txt nesting” comment in the code).

This works for two-level patterns (Cloudflare per-product files, Supabase aggregate content files) but undercounts sites that use deeper progressive disclosure — which the spec encourages for large sites that would otherwise exceed llms-txt-size.

The visible symptom is llms-txt-freshness reporting very low coverage on sites whose llms.txt is actually exhaustive — the walker just isn’t reaching the leaves.

Concrete example

Site: alchemy.com/docs

Three-level structure (correctly organized per spec — sections split because a unified file would exceed llms-txt-size):

/docs/llms.txt                           # 6 section links, all .txt
  └─ /docs/get-started/llms.txt          # .md page links     [walked]
  └─ /docs/node/llms.txt                 # .md page links     [walked]
  └─ /docs/data/llms.txt                 # .md page links     [walked]
  └─ /docs/wallets/llms.txt              # .md page links     [walked]
  └─ /docs/rollups/llms.txt              # .md page links     [walked]
  └─ /docs/chains/llms.txt               # ~80 chain .txt links [walked, children dropped]
      └─ /docs/chains/ethereum/llms.txt  # eth_call, eth_chainId, …  [NOT REACHED]
      └─ /docs/chains/solana/llms.txt    # getBalance, getSlot, …    [NOT REACHED]
      └─ … 80 more chains                                            [NOT REACHED]

Five sections have a flat layout, so their pages are counted (~311 page URLs total). The Chains section has another nesting level for per-chain files, and all ~5,100 method pages live at depth 2. Those never make it into the URL pool.

Verbose afdocs output:

✗ llms-txt-freshness: llms.txt covers 311/5452 sitemap doc pages (6%); 5141 missing
      Fix: Your llms.txt covers less than 80% of your site's pages. ...

The 5,141 “missing” pages are mostly chain-specific RPC method docs:

  • /docs/chains/ethereum/ethereum-api-endpoints/eth-call
  • /docs/chains/solana/solana-api-endpoints/get-balance

These are in the llms.txt tree — just one level deeper than the walker currently explores.

This also biases sampling for any check that flows through getUrlsFromCachedLlmsTxt:

  • markdown-content-parity
  • llms-txt-directive
  • etc.

All sample only from the 5 flat sections, never from the ~5,100 chain method pages.

Suggested behaviors (in priority order)

  1. Walk recursively, with bounded depth and file count

    • Stop at e.g. depth = 5 and ≤200 total fetches
    • Deduplicate visited .txt URLs so cycles and shared sub-indexes don’t cause repeated fetches
    • Current behavior is effectively “depth = 1 hardcoded”
  2. Treat the aggregate walk uniformly
    The seed URLs from the canonical llms.txt and discovered sub-links should go through the same classification logic (page URL vs aggregate-to-walk), instead of the current setup where the inner loop uses a different filter than the outer one.

  3. Surface the walked tree in details
    When verbose:

    • List which aggregate files were fetched
    • Include depth information
    • Show whether safety caps were hit

Workarounds tried

  • None viable from the user side.
    The site’s llms.txt is structured exactly as the spec recommends; the bug is on the consumer (afdocs) side.

  • Flattening llms.txt is technically possible, but defeats the purpose of the size-based recommendation to split files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions