Feature Request: OpenBrain Playwright Fallback Retrieval
Work item: OB-0MNHT5HTC0070EL7 (OpenBrain project; tracked in this repo as GitHub issue #5)
Stage: intake_complete
Prepared by: SourceBase (Discord bot integration layer)
Handoff target: OpenBrain PM / Engineering
Problem Statement
Some web pages render their primary content with client-side JavaScript. The current OpenBrain
retrieval path (fast HTML extraction) fails to return usable content for these pages, causing
ob add <url> to ingest an empty or near-empty document. A fallback that uses a headless browser
(Playwright) is needed so OpenBrain can ingest JavaScript-heavy pages reliably when the primary
extractor returns insufficient content.
Users and User Stories
Discord community operators
As a community operator, when someone posts a JS-heavy article in our Discord I want the link
ingested by OpenBrain so the knowledge is searchable — without needing to manually pre-render
the page.
Automation authors / operators running ob add
As an automation author I want a configurable opt-in fallback so I can control the additional
resource cost and CI behaviour that a headless browser introduces.
OpenBrain engineers implementing the fallback
As an engineer I need a clear set of technical acceptance criteria and test strategies so I can
implement Playwright retrieval safely and verify it in CI without requiring a real browser on
every run.
Technical Acceptance Criteria
-
PlaywrightExtractor class — A new extractor (e.g. src/lib/ingestion/extractor-playwright.ts)
implements the same interface as the existing extractor so it can be swapped in without changes
to the ingestion pipeline (see src/lib/ingestion/service.ts and src/lib/ingestion/extractor.ts).
-
Opt-in configuration flag — Playwright fallback is disabled by default. It is enabled via an
explicit configuration flag (e.g. ingestion.playwrightFallback: true in the OpenBrain config
file or OB_PLAYWRIGHT_FALLBACK=1 environment variable). Running ob add without the flag must
not launch a browser process.
-
Trigger condition — The fallback fires only when the primary extractor returns a document
whose extracted-text length is below a configurable threshold (e.g. minContentLength: 200
characters). The threshold must be configurable and default to a value determined during
implementation.
-
Content compatibility — The HTML/text produced by PlaywrightExtractor must be parseable by
the same downstream ingestion pipeline that processes output from the existing extractor (entry
point: src/cli/commands/add.ts). No structural changes to the ingestion pipeline are required.
-
Graceful degradation — If Playwright is not installed or fails to launch, the ingestion run
logs a warning and completes with whatever content the primary extractor returned (empty or
partial), rather than throwing an unhandled error.
-
No credential leakage — The Playwright session must not persist cookies, local storage, or
auth tokens between runs. Each retrieval uses a fresh browser context.
-
Timeout — Browser navigation has an explicit timeout (default: 30 s, configurable). A
timeout is treated the same as a Playwright launch failure (warn + continue).
-
Telemetry emitted — On every fallback invocation the implementation emits a structured log
entry (see Telemetry section).
CI / Testing Strategy
Guiding principle
Playwright introduces a real browser runtime that is impractical on every public CI run. The
testing strategy separates fast unit tests (always run) from integration tests (opt-in or
gated).
Record / playback fixtures (recommended default)
- Record a set of HTTP interaction fixtures (e.g. using Playwright's network interception or a
companion HTTP mock server such as msw / nock) covering:
- A JS-heavy page that the primary extractor would return empty content for.
- A page that the primary extractor handles correctly (fallback must not fire).
- A page that returns a Playwright navigation timeout.
- A page behind a redirect chain.
- Store fixtures in
test/fixtures/playwright-fallback/.
- Unit / integration tests use the fixtures; real browser is never launched in public CI.
Mock / stub option (minimal)
An alternative for projects that cannot store HTTP fixtures: stub PlaywrightExtractor.fetch(url)
at the module boundary and assert on:
- The correct call sequencing (primary extractor first, then fallback).
- The telemetry payload emitted.
- Graceful-degradation paths (install error, timeout).
Full integration test (opt-in)
Gate behind an environment variable OB_PLAYWRIGHT_INTEGRATION=1. When set:
- Actually launch a Chromium browser.
- Fetch one real JS-heavy URL (or a local dev server that serves a SPA).
- Assert that ingested content is non-empty and passes the minimum-length threshold.
Suggested CI gate: run only on main branch pushes or nightly schedules; never on pull request
CI from forks to avoid resource/cost issues.
Self-hosted runner consideration
Full integration tests should target a self-hosted runner with Chromium pre-installed, or use the
microsoft/playwright Docker image. Document the runner label requirement in the GitHub Actions
workflow file.
Existing test reference
See the YouTube ingestion fallback (OB-0MNFXR3E4005TGYX) for a prior example of provider-specific
fallback tests and diagnostics.
Telemetry / Diagnostics Requirements
Every Playwright fallback invocation must emit a structured log entry. Telemetry must be
non-sensitive: record only metadata, never user content, URLs, or secrets.
Suggested fields:
| Field |
Type |
Description |
event |
string |
Fixed value "playwright_fallback" |
triggered |
boolean |
Whether the fallback actually ran (false = threshold not met) |
primaryContentLength |
number |
Character count returned by primary extractor |
fallbackContentLength |
number |
Character count returned by PlaywrightExtractor (0 if not run or failed) |
durationMs |
number |
Wall-clock time of the Playwright fetch in milliseconds |
success |
boolean |
Whether PlaywrightExtractor returned usable content |
errorType |
string |
One of: "launch_failed", "timeout", "navigation_error", null |
provider |
string |
Always "playwright" for this fallback |
Log destination: existing OpenBrain diagnostics / structured logger. Do not write telemetry to a
remote endpoint; keep it local and opt-in to ship to any external service.
Implementation Sketch
src/lib/ingestion/
├── extractor.ts (existing — primary extractor, no changes required)
├── extractor-playwright.ts (new — PlaywrightExtractor, same interface)
└── service.ts (existing — add fallback orchestration logic)
src/cli/commands/add.ts (existing — passes through unchanged)
Suggested orchestration in service.ts:
const primaryResult = await primaryExtractor.extract(url);
if (
config.playwrightFallback &&
primaryResult.text.length < config.minContentLength
) {
const fallbackResult = await playwrightExtractor.extract(url);
emitTelemetry({ triggered: true, ...metrics });
return fallbackResult.text.length > 0 ? fallbackResult : primaryResult;
}
emitTelemetry({ triggered: false, primaryContentLength: primaryResult.text.length });
return primaryResult;
The above is illustrative. Exact method signatures must match the extractor interface defined in
src/lib/ingestion/extractor.ts.
Dependency management
Add playwright (or @playwright/test) as an optional peer dependency so users who do not
enable the fallback do not need to install it. Guard the require/import behind a runtime
check to avoid import errors when the package is absent.
Related Work and Code Pointers
Implementers should review the following before starting:
| Reference |
Relevance |
| OB-0MN9HWGAL001452N — Ingest CLI: file and URL ingestion |
Primary ingestion flow. Playwright output must be compatible with this pipeline. |
| OB-0MNFXR3E4005TGYX — Fix YouTube ingestion for ob add |
Prior provider-specific fallback: tests, diagnostics, and error handling patterns to reuse. |
| OB-0MN9CZ48N0053L9Q — Create a full PRD for OpenBrain |
Product-level guidance: local-first preferences and documented fallback policies. |
src/cli/commands/add.ts |
CLI entry point for ob add; Playwright output must be usable here. |
src/lib/ingestion/service.ts |
Ingestion orchestration; fallback logic belongs here. |
src/lib/ingestion/extractor.ts |
Extractor interface that PlaywrightExtractor must implement. |
Constraints
- Scope boundary: Implementation belongs in the OpenBrain repo. SourceBase (this repository)
is the Discord bot integration layer and produced this document only.
- Opt-in only: Playwright must never run unless explicitly enabled; do not change default
behaviour of ob add.
- No secrets in telemetry: Telemetry fields must not include URL content, page text, cookies,
auth tokens, or any user-identifiable data.
- Optional dependency: Playwright is a heavy runtime dependency. Declare it as optional/peer
so existing installs are not forced to download browser binaries.
- Record/playback for public CI: Real browser launches must be gated; fixture-based tests must
be the default.
Open Questions
- Should the minimum-content-length threshold be per-domain (allow-list) or global? Global is
simpler; per-domain gives more control.
- Which Playwright browser channel should be the default (
chromium, firefox, webkit)?
Recommend chromium for widest compatibility, but make it configurable.
- Should the fallback be retried on timeout, or fail immediately? Recommend fail-immediately on
first timeout to keep latency predictable.
- Is there an existing structured-log / telemetry abstraction in OpenBrain to hook into, or does
the engineer need to introduce one?
Acceptance Checklist (for PM / Engineering reviewers)
Feature Request: OpenBrain Playwright Fallback Retrieval
Work item: OB-0MNHT5HTC0070EL7 (OpenBrain project; tracked in this repo as GitHub issue #5)
Stage: intake_complete
Prepared by: SourceBase (Discord bot integration layer)
Handoff target: OpenBrain PM / Engineering
Problem Statement
Some web pages render their primary content with client-side JavaScript. The current OpenBrain
retrieval path (fast HTML extraction) fails to return usable content for these pages, causing
ob add <url>to ingest an empty or near-empty document. A fallback that uses a headless browser(Playwright) is needed so OpenBrain can ingest JavaScript-heavy pages reliably when the primary
extractor returns insufficient content.
Users and User Stories
Discord community operators
Automation authors / operators running
ob addOpenBrain engineers implementing the fallback
Technical Acceptance Criteria
PlaywrightExtractor class — A new extractor (e.g.
src/lib/ingestion/extractor-playwright.ts)implements the same interface as the existing extractor so it can be swapped in without changes
to the ingestion pipeline (see
src/lib/ingestion/service.tsandsrc/lib/ingestion/extractor.ts).Opt-in configuration flag — Playwright fallback is disabled by default. It is enabled via an
explicit configuration flag (e.g.
ingestion.playwrightFallback: truein the OpenBrain configfile or
OB_PLAYWRIGHT_FALLBACK=1environment variable). Runningob addwithout the flag mustnot launch a browser process.
Trigger condition — The fallback fires only when the primary extractor returns a document
whose extracted-text length is below a configurable threshold (e.g.
minContentLength: 200characters). The threshold must be configurable and default to a value determined during
implementation.
Content compatibility — The HTML/text produced by PlaywrightExtractor must be parseable by
the same downstream ingestion pipeline that processes output from the existing extractor (entry
point:
src/cli/commands/add.ts). No structural changes to the ingestion pipeline are required.Graceful degradation — If Playwright is not installed or fails to launch, the ingestion run
logs a warning and completes with whatever content the primary extractor returned (empty or
partial), rather than throwing an unhandled error.
No credential leakage — The Playwright session must not persist cookies, local storage, or
auth tokens between runs. Each retrieval uses a fresh browser context.
Timeout — Browser navigation has an explicit timeout (default: 30 s, configurable). A
timeout is treated the same as a Playwright launch failure (warn + continue).
Telemetry emitted — On every fallback invocation the implementation emits a structured log
entry (see Telemetry section).
CI / Testing Strategy
Guiding principle
Playwright introduces a real browser runtime that is impractical on every public CI run. The
testing strategy separates fast unit tests (always run) from integration tests (opt-in or
gated).
Record / playback fixtures (recommended default)
companion HTTP mock server such as
msw/nock) covering:test/fixtures/playwright-fallback/.Mock / stub option (minimal)
An alternative for projects that cannot store HTTP fixtures: stub
PlaywrightExtractor.fetch(url)at the module boundary and assert on:
Full integration test (opt-in)
Gate behind an environment variable
OB_PLAYWRIGHT_INTEGRATION=1. When set:Suggested CI gate: run only on
mainbranch pushes or nightly schedules; never on pull requestCI from forks to avoid resource/cost issues.
Self-hosted runner consideration
Full integration tests should target a self-hosted runner with Chromium pre-installed, or use the
microsoft/playwrightDocker image. Document the runner label requirement in the GitHub Actionsworkflow file.
Existing test reference
See the YouTube ingestion fallback (OB-0MNFXR3E4005TGYX) for a prior example of provider-specific
fallback tests and diagnostics.
Telemetry / Diagnostics Requirements
Every Playwright fallback invocation must emit a structured log entry. Telemetry must be
non-sensitive: record only metadata, never user content, URLs, or secrets.
Suggested fields:
event"playwright_fallback"triggeredprimaryContentLengthfallbackContentLengthdurationMssuccesserrorType"launch_failed","timeout","navigation_error",nullprovider"playwright"for this fallbackLog destination: existing OpenBrain diagnostics / structured logger. Do not write telemetry to a
remote endpoint; keep it local and opt-in to ship to any external service.
Implementation Sketch
Suggested orchestration in
service.ts:The above is illustrative. Exact method signatures must match the extractor interface defined in
src/lib/ingestion/extractor.ts.Dependency management
Add
playwright(or@playwright/test) as an optional peer dependency so users who do notenable the fallback do not need to install it. Guard the
require/importbehind a runtimecheck to avoid import errors when the package is absent.
Related Work and Code Pointers
Implementers should review the following before starting:
src/cli/commands/add.tsob add; Playwright output must be usable here.src/lib/ingestion/service.tssrc/lib/ingestion/extractor.tsConstraints
is the Discord bot integration layer and produced this document only.
behaviour of
ob add.auth tokens, or any user-identifiable data.
so existing installs are not forced to download browser binaries.
be the default.
Open Questions
simpler; per-domain gives more control.
chromium,firefox,webkit)?Recommend
chromiumfor widest compatibility, but make it configurable.first timeout to keep latency predictable.
the engineer need to introduce one?
Acceptance Checklist (for PM / Engineering reviewers)
ingestion.playwrightFallback)