diff --git a/.worklog/config.yaml b/.worklog/config.yaml index 56d333c7..12db49f2 100644 --- a/.worklog/config.yaml +++ b/.worklog/config.yaml @@ -1,5 +1,6 @@ projectName: Worklog prefix: WL -autoExport: true autoSync: false githubRepo: TheWizardsCode/ContextHub +githubLabelPrefix: 'wl:' +githubImportCreateNew: true diff --git a/docs/feature-requests/openbrain-playwright-fallback-retrieval.md b/docs/feature-requests/openbrain-playwright-fallback-retrieval.md new file mode 100644 index 00000000..aa04611c --- /dev/null +++ b/docs/feature-requests/openbrain-playwright-fallback-retrieval.md @@ -0,0 +1,229 @@ +# Feature Request: OpenBrain Playwright Fallback Retrieval + +**Work item:** OB-0MNHT5HTC0070EL7 *(OpenBrain project; tracked in this repo as GitHub issue #5)* +**Stage:** intake_complete +**Prepared by:** SourceBase (Discord bot integration layer) +**Handoff target:** OpenBrain PM / Engineering + +--- + +## Problem Statement + +Some web pages render their primary content with client-side JavaScript. The current OpenBrain +retrieval path (fast HTML extraction) fails to return usable content for these pages, causing +`ob add ` to ingest an empty or near-empty document. A fallback that uses a headless browser +(Playwright) is needed so OpenBrain can ingest JavaScript-heavy pages reliably when the primary +extractor returns insufficient content. + +--- + +## Users and User Stories + +### Discord community operators +> As a community operator, when someone posts a JS-heavy article in our Discord I want the link +> ingested by OpenBrain so the knowledge is searchable — without needing to manually pre-render +> the page. + +### Automation authors / operators running `ob add` +> As an automation author I want a configurable opt-in fallback so I can control the additional +> resource cost and CI behaviour that a headless browser introduces. + +### OpenBrain engineers implementing the fallback +> As an engineer I need a clear set of technical acceptance criteria and test strategies so I can +> implement Playwright retrieval safely and verify it in CI without requiring a real browser on +> every run. + +--- + +## Technical Acceptance Criteria + +1. **PlaywrightExtractor class** — A new extractor (e.g. `src/lib/ingestion/extractor-playwright.ts`) + implements the same interface as the existing extractor so it can be swapped in without changes + to the ingestion pipeline (see `src/lib/ingestion/service.ts` and `src/lib/ingestion/extractor.ts`). + +2. **Opt-in configuration flag** — Playwright fallback is disabled by default. It is enabled via an + explicit configuration flag (e.g. `ingestion.playwrightFallback: true` in the OpenBrain config + file or `OB_PLAYWRIGHT_FALLBACK=1` environment variable). Running `ob add` without the flag must + not launch a browser process. + +3. **Trigger condition** — The fallback fires only when the primary extractor returns a document + whose extracted-text length is below a configurable threshold (e.g. `minContentLength: 200` + characters). The threshold must be configurable and default to a value determined during + implementation. + +4. **Content compatibility** — The HTML/text produced by PlaywrightExtractor must be parseable by + the same downstream ingestion pipeline that processes output from the existing extractor (entry + point: `src/cli/commands/add.ts`). No structural changes to the ingestion pipeline are required. + +5. **Graceful degradation** — If Playwright is not installed or fails to launch, the ingestion run + logs a warning and completes with whatever content the primary extractor returned (empty or + partial), rather than throwing an unhandled error. + +6. **No credential leakage** — The Playwright session must not persist cookies, local storage, or + auth tokens between runs. Each retrieval uses a fresh browser context. + +7. **Timeout** — Browser navigation has an explicit timeout (default: 30 s, configurable). A + timeout is treated the same as a Playwright launch failure (warn + continue). + +8. **Telemetry emitted** — On every fallback invocation the implementation emits a structured log + entry (see Telemetry section). + +--- + +## CI / Testing Strategy + +### Guiding principle +Playwright introduces a real browser runtime that is impractical on every public CI run. The +testing strategy separates fast unit tests (always run) from integration tests (opt-in or +gated). + +### Record / playback fixtures (recommended default) +1. Record a set of HTTP interaction fixtures (e.g. using Playwright's network interception or a + companion HTTP mock server such as `msw` / `nock`) covering: + - A JS-heavy page that the primary extractor would return empty content for. + - A page that the primary extractor handles correctly (fallback must not fire). + - A page that returns a Playwright navigation timeout. + - A page behind a redirect chain. +2. Store fixtures in `test/fixtures/playwright-fallback/`. +3. Unit / integration tests use the fixtures; real browser is never launched in public CI. + +### Mock / stub option (minimal) +An alternative for projects that cannot store HTTP fixtures: stub `PlaywrightExtractor.fetch(url)` +at the module boundary and assert on: +- The correct call sequencing (primary extractor first, then fallback). +- The telemetry payload emitted. +- Graceful-degradation paths (install error, timeout). + +### Full integration test (opt-in) +Gate behind an environment variable `OB_PLAYWRIGHT_INTEGRATION=1`. When set: +- Actually launch a Chromium browser. +- Fetch one real JS-heavy URL (or a local dev server that serves a SPA). +- Assert that ingested content is non-empty and passes the minimum-length threshold. + +Suggested CI gate: run only on `main` branch pushes or nightly schedules; never on pull request +CI from forks to avoid resource/cost issues. + +### Self-hosted runner consideration +Full integration tests should target a self-hosted runner with Chromium pre-installed, or use the +`microsoft/playwright` Docker image. Document the runner label requirement in the GitHub Actions +workflow file. + +### Existing test reference +See the YouTube ingestion fallback (OB-0MNFXR3E4005TGYX) for a prior example of provider-specific +fallback tests and diagnostics. + +--- + +## Telemetry / Diagnostics Requirements + +Every Playwright fallback invocation must emit a structured log entry. Telemetry must be +non-sensitive: record only metadata, never user content, URLs, or secrets. + +Suggested fields: + +| Field | Type | Description | +|---|---|---| +| `event` | string | Fixed value `"playwright_fallback"` | +| `triggered` | boolean | Whether the fallback actually ran (false = threshold not met) | +| `primaryContentLength` | number | Character count returned by primary extractor | +| `fallbackContentLength` | number | Character count returned by PlaywrightExtractor (0 if not run or failed) | +| `durationMs` | number | Wall-clock time of the Playwright fetch in milliseconds | +| `success` | boolean | Whether PlaywrightExtractor returned usable content | +| `errorType` | string \| null | One of: `"launch_failed"`, `"timeout"`, `"navigation_error"`, `null` | +| `provider` | string | Always `"playwright"` for this fallback | + +Log destination: existing OpenBrain diagnostics / structured logger. Do not write telemetry to a +remote endpoint; keep it local and opt-in to ship to any external service. + +--- + +## Implementation Sketch + +``` +src/lib/ingestion/ +├── extractor.ts (existing — primary extractor, no changes required) +├── extractor-playwright.ts (new — PlaywrightExtractor, same interface) +└── service.ts (existing — add fallback orchestration logic) + +src/cli/commands/add.ts (existing — passes through unchanged) +``` + +Suggested orchestration in `service.ts`: + +```typescript +const primaryResult = await primaryExtractor.extract(url); +if ( + config.playwrightFallback && + primaryResult.text.length < config.minContentLength +) { + const fallbackResult = await playwrightExtractor.extract(url); + emitTelemetry({ triggered: true, ...metrics }); + return fallbackResult.text.length > 0 ? fallbackResult : primaryResult; +} +emitTelemetry({ triggered: false, primaryContentLength: primaryResult.text.length }); +return primaryResult; +``` + +The above is illustrative. Exact method signatures must match the extractor interface defined in +`src/lib/ingestion/extractor.ts`. + +### Dependency management +Add `playwright` (or `@playwright/test`) as an optional peer dependency so users who do not +enable the fallback do not need to install it. Guard the `require`/`import` behind a runtime +check to avoid import errors when the package is absent. + +--- + +## Related Work and Code Pointers + +Implementers should review the following before starting: + +| Reference | Relevance | +|---|---| +| OB-0MN9HWGAL001452N — Ingest CLI: file and URL ingestion | Primary ingestion flow. Playwright output must be compatible with this pipeline. | +| OB-0MNFXR3E4005TGYX — Fix YouTube ingestion for ob add | Prior provider-specific fallback: tests, diagnostics, and error handling patterns to reuse. | +| OB-0MN9CZ48N0053L9Q — Create a full PRD for OpenBrain | Product-level guidance: local-first preferences and documented fallback policies. | +| `src/cli/commands/add.ts` | CLI entry point for `ob add`; Playwright output must be usable here. | +| `src/lib/ingestion/service.ts` | Ingestion orchestration; fallback logic belongs here. | +| `src/lib/ingestion/extractor.ts` | Extractor interface that PlaywrightExtractor must implement. | + +--- + +## Constraints + +- **Scope boundary:** Implementation belongs in the OpenBrain repo. SourceBase (this repository) + is the Discord bot integration layer and produced this document only. +- **Opt-in only:** Playwright must never run unless explicitly enabled; do not change default + behaviour of `ob add`. +- **No secrets in telemetry:** Telemetry fields must not include URL content, page text, cookies, + auth tokens, or any user-identifiable data. +- **Optional dependency:** Playwright is a heavy runtime dependency. Declare it as optional/peer + so existing installs are not forced to download browser binaries. +- **Record/playback for public CI:** Real browser launches must be gated; fixture-based tests must + be the default. + +--- + +## Open Questions + +1. Should the minimum-content-length threshold be per-domain (allow-list) or global? Global is + simpler; per-domain gives more control. +2. Which Playwright browser channel should be the default (`chromium`, `firefox`, `webkit`)? + Recommend `chromium` for widest compatibility, but make it configurable. +3. Should the fallback be retried on timeout, or fail immediately? Recommend fail-immediately on + first timeout to keep latency predictable. +4. Is there an existing structured-log / telemetry abstraction in OpenBrain to hook into, or does + the engineer need to introduce one? + +--- + +## Acceptance Checklist (for PM / Engineering reviewers) + +- [ ] Feature request document reviewed and approved +- [ ] Child work items created in OpenBrain: PlaywrightExtractor, CI test harness, telemetry +- [ ] Configuration flag name agreed (e.g. `ingestion.playwrightFallback`) +- [ ] Extractor interface reviewed; PlaywrightExtractor signature confirmed +- [ ] Record/playback fixture strategy confirmed or alternative agreed +- [ ] Self-hosted runner label and CI gate condition documented +- [ ] Optional-dependency approach confirmed (peer dep vs. dynamic import guard) +- [ ] Telemetry field list reviewed; privacy sign-off obtained