Skip to content

Feature request: OpenBrain Playwright fallback retrieval (SourceBase: produce request doc) #5

@SorraTheOrc

Description

@SorraTheOrc

Intake Draft � OpenBrain Playwright fallback retrieval (OB-0MNHT5HTC0070EL7)

Headline summary
Produce a concise feature request document (docs/feature-requests/openbrain-playwright-fallback-retrieval.md) that specifies requirements, acceptance criteria, CI/test strategies, and telemetry for adding a Playwright-based retrieval fallback in OpenBrain. SourceBase will author and hand this document to PM/Engineering; implementation belongs in the OpenBrain repo.

Problem statement
Some web pages render their primary content with client-side JavaScript. The current OpenBrain retrieval path (fast HTML extraction) can fail to return usable content for these pages, causing ingestion to fail. A documented fallback that uses a headless browser (Playwright) is needed so OpenBrain can ingest JavaScript-heavy pages reliably when the primary extractor is insufficient.

Users

  • Discord community operators who post links and expect them to be indexed by OpenBrain (example: as a community operator, when someone posts a JS-heavy article I want the link ingested so the knowledge is searchable).
  • OpenBrain operators and automation authors who run ob add in varied environments (example: as an automation author I want a configurable fallback so I can control resource and CI behaviour).
  • OpenBrain engineers who will implement the fallback (example: as an engineer I need a clear set of technical acceptance criteria and test strategies to implement Playwright retrieval safely).

Success criteria

  • The feature request document exists at docs/feature-requests/openbrain-playwright-fallback-retrieval.md and contains: problem statement, users and user stories, technical acceptance criteria, CI/testing strategy (record/playback + mock option), telemetry/diagnostics requirements, and an implementation sketch.
  • The work item OB-0MNHT5HTC0070EL7 references the doc (link present in description) and the work item stage is set to intake_complete.
  • The document references the following related code and work items so implementers can start without additional discovery: OB-0MN9HWGAL001452N, OB-0MNFXR3E4005TGYX, src/cli/commands/add.ts, src/lib/ingestion/service.ts, and src/lib/ingestion/extractor.ts.

Constraints

  • This repository (SourceBase) is the Discord bot integration layer; the retrieval fallback implementation belongs in the OpenBrain repo. SourceBase's scope is limited to producing the feature request doc and making any necessary documentation/behavior changes to the bot if explicitly requested.
  • Playwright introduces platform/runtime dependencies and resource costs; prefer an opt-in configuration flag and record/playback fixtures for public CI runs.
  • Telemetry and diagnostics must be non-sensitive and must record only metadata (fallback used, provider, duration, error notes); do not persist user secrets.

Existing state

  • Work item exists: OB-0MNHT5HTC0070EL7 (current stage: idea; assignee Map).
  • No file found at docs/feature-requests/openbrain-playwright-fallback-retrieval.md (agent search: file not present).
  • The repo already uses Playwright in test-related tooling (dev deps include @vitest/browser-playwright) and the ingestion pipeline is exercised by ob add (entrypoints: src/cli/commands/add.ts and src/lib/ingestion/service.ts).
  • Related prior work items exist addressing ingestion and provider-specific fallbacks (see Related work below).

Desired change

  • Create docs/feature-requests/openbrain-playwright-fallback-retrieval.md containing the sections described under Success criteria. The document should be review-ready for PM/Engineering handoff and include suggested telemetry fields and a CI/test strategy.
  • Update OB-0MNHT5HTC0070EL7 description to link the doc and set stage to intake_complete.
  • Optionally: create follow-up child work items in OpenBrain for implementation (PlaywrightExtractor, CI test harness, telemetry), but do not implement code changes in SourceBase as part of this item.

Related work

  • OB-0MN9HWGAL001452N � Ingest CLI: file and URL ingestion
    Relevance: Primary ingestion flow (src/cli/commands/add.ts -> src/lib/ingestion/service.ts). Playwright output should be compatible with this ingestion pipeline.
  • OB-0MNFXR3E4005TGYX � Fix YouTube ingestion for ob add
    Relevance: Example of a provider-specific fallback and associated tests/diagnostics.
  • OB-0MN9CZ48N0053L9Q � Create a full PRD for OpenBrain
    Relevance: Product-level guidance: local-first preferences and documented fallback policies.
  • OB-0MNGPYRSR00472F3 � CLI: Add "ob summary " command
    Relevance: Demonstrates how CLI triggers ingestion/summarization and where Playwright-extracted content would be consumed.
  • OB-0MNK32JBQ008T8ND � Add CI workflow to run benchmark and publish results
    Relevance: CI/record-playback strategy and notes about mocking heavy dependencies for public CI.

Relevant files (starting points)

  • src/cli/commands/add.ts � CLI entrypoint for URL ingestion
  • src/lib/ingestion/service.ts � ingestion pipeline (extraction -> summarization -> persist)
  • src/lib/ingestion/extractor.ts � extractor interface and plugin points
  • src/lib/ingestion/youtube.ts � provider-specific pattern (YouTube handler)
  • tests/acceptance/ingest-e2e.test.ts � acceptance harness to validate retrieval -> summarization -> DB

Appendix: Clarifying questions & answers

  • Q: "Should SourceBase implement the Playwright fallback code or produce a feature request document?" � Answer (work item OB-0MNHT5HTC0070EL7): "Produce a feature request document in this repo; the retrieval fallback implementation belongs in OpenBrain." Source: work item description. Final: yes.
  • Q: "Does the feature request document already exist at docs/feature-requests/openbrain-playwright-fallback-retrieval.md?" � Answer (agent search): No. Evidence: attempt to read file returned File not found; repository search returned no matches for that path.
  • Q: "What related work items and files should be referenced in the doc?" � Answer (agent inference using wl search and repo grep): See Related work and Relevant files sections above. Evidence: wl search results and repo file matches (src/cli/commands/add.ts, src/lib/ingestion/service.ts, tests/acceptance/ingest-e2e.test.ts). Final: included.
  • Q: "Are Playwright dependencies already present in the repo tooling?" � Answer (agent search): Yes; dev dependencies include @vitest/browser-playwright (package-lock.json) and tests reference Playwright in test harness files. Evidence: package-lock and test files.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions