Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .worklog/config.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
projectName: Worklog
prefix: WL
autoExport: true
autoSync: false
githubRepo: TheWizardsCode/ContextHub
githubLabelPrefix: 'wl:'
githubImportCreateNew: true
229 changes: 229 additions & 0 deletions docs/feature-requests/openbrain-playwright-fallback-retrieval.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
# Feature Request: OpenBrain Playwright Fallback Retrieval

**Work item:** OB-0MNHT5HTC0070EL7 *(OpenBrain project; tracked in this repo as GitHub issue #5)*
**Stage:** intake_complete
**Prepared by:** SourceBase (Discord bot integration layer)
**Handoff target:** OpenBrain PM / Engineering

---

## Problem Statement

Some web pages render their primary content with client-side JavaScript. The current OpenBrain
retrieval path (fast HTML extraction) fails to return usable content for these pages, causing
`ob add <url>` to ingest an empty or near-empty document. A fallback that uses a headless browser
(Playwright) is needed so OpenBrain can ingest JavaScript-heavy pages reliably when the primary
extractor returns insufficient content.

---

## Users and User Stories

### Discord community operators
> As a community operator, when someone posts a JS-heavy article in our Discord I want the link
> ingested by OpenBrain so the knowledge is searchable — without needing to manually pre-render
> the page.

### Automation authors / operators running `ob add`
> As an automation author I want a configurable opt-in fallback so I can control the additional
> resource cost and CI behaviour that a headless browser introduces.

### OpenBrain engineers implementing the fallback
> As an engineer I need a clear set of technical acceptance criteria and test strategies so I can
> implement Playwright retrieval safely and verify it in CI without requiring a real browser on
> every run.

---

## Technical Acceptance Criteria

1. **PlaywrightExtractor class** — A new extractor (e.g. `src/lib/ingestion/extractor-playwright.ts`)
implements the same interface as the existing extractor so it can be swapped in without changes
to the ingestion pipeline (see `src/lib/ingestion/service.ts` and `src/lib/ingestion/extractor.ts`).

2. **Opt-in configuration flag** — Playwright fallback is disabled by default. It is enabled via an
explicit configuration flag (e.g. `ingestion.playwrightFallback: true` in the OpenBrain config
file or `OB_PLAYWRIGHT_FALLBACK=1` environment variable). Running `ob add` without the flag must
not launch a browser process.

3. **Trigger condition** — The fallback fires only when the primary extractor returns a document
whose extracted-text length is below a configurable threshold (e.g. `minContentLength: 200`
characters). The threshold must be configurable and default to a value determined during
implementation.

4. **Content compatibility** — The HTML/text produced by PlaywrightExtractor must be parseable by
the same downstream ingestion pipeline that processes output from the existing extractor (entry
point: `src/cli/commands/add.ts`). No structural changes to the ingestion pipeline are required.

5. **Graceful degradation** — If Playwright is not installed or fails to launch, the ingestion run
logs a warning and completes with whatever content the primary extractor returned (empty or
partial), rather than throwing an unhandled error.

6. **No credential leakage** — The Playwright session must not persist cookies, local storage, or
auth tokens between runs. Each retrieval uses a fresh browser context.

7. **Timeout** — Browser navigation has an explicit timeout (default: 30 s, configurable). A
timeout is treated the same as a Playwright launch failure (warn + continue).

8. **Telemetry emitted** — On every fallback invocation the implementation emits a structured log
entry (see Telemetry section).

---

## CI / Testing Strategy

### Guiding principle
Playwright introduces a real browser runtime that is impractical on every public CI run. The
testing strategy separates fast unit tests (always run) from integration tests (opt-in or
gated).

### Record / playback fixtures (recommended default)
1. Record a set of HTTP interaction fixtures (e.g. using Playwright's network interception or a
companion HTTP mock server such as `msw` / `nock`) covering:
- A JS-heavy page that the primary extractor would return empty content for.
- A page that the primary extractor handles correctly (fallback must not fire).
- A page that returns a Playwright navigation timeout.
- A page behind a redirect chain.
2. Store fixtures in `test/fixtures/playwright-fallback/`.
3. Unit / integration tests use the fixtures; real browser is never launched in public CI.

### Mock / stub option (minimal)
An alternative for projects that cannot store HTTP fixtures: stub `PlaywrightExtractor.fetch(url)`
at the module boundary and assert on:
- The correct call sequencing (primary extractor first, then fallback).
- The telemetry payload emitted.
- Graceful-degradation paths (install error, timeout).

### Full integration test (opt-in)
Gate behind an environment variable `OB_PLAYWRIGHT_INTEGRATION=1`. When set:
- Actually launch a Chromium browser.
- Fetch one real JS-heavy URL (or a local dev server that serves a SPA).
- Assert that ingested content is non-empty and passes the minimum-length threshold.

Suggested CI gate: run only on `main` branch pushes or nightly schedules; never on pull request
CI from forks to avoid resource/cost issues.

### Self-hosted runner consideration
Full integration tests should target a self-hosted runner with Chromium pre-installed, or use the
`microsoft/playwright` Docker image. Document the runner label requirement in the GitHub Actions
workflow file.

### Existing test reference
See the YouTube ingestion fallback (OB-0MNFXR3E4005TGYX) for a prior example of provider-specific
fallback tests and diagnostics.

---

## Telemetry / Diagnostics Requirements

Every Playwright fallback invocation must emit a structured log entry. Telemetry must be
non-sensitive: record only metadata, never user content, URLs, or secrets.

Suggested fields:

| Field | Type | Description |
|---|---|---|
| `event` | string | Fixed value `"playwright_fallback"` |
| `triggered` | boolean | Whether the fallback actually ran (false = threshold not met) |
| `primaryContentLength` | number | Character count returned by primary extractor |
| `fallbackContentLength` | number | Character count returned by PlaywrightExtractor (0 if not run or failed) |
| `durationMs` | number | Wall-clock time of the Playwright fetch in milliseconds |
| `success` | boolean | Whether PlaywrightExtractor returned usable content |
| `errorType` | string \| null | One of: `"launch_failed"`, `"timeout"`, `"navigation_error"`, `null` |
| `provider` | string | Always `"playwright"` for this fallback |

Log destination: existing OpenBrain diagnostics / structured logger. Do not write telemetry to a
remote endpoint; keep it local and opt-in to ship to any external service.

---

## Implementation Sketch

```
src/lib/ingestion/
├── extractor.ts (existing — primary extractor, no changes required)
├── extractor-playwright.ts (new — PlaywrightExtractor, same interface)
└── service.ts (existing — add fallback orchestration logic)

src/cli/commands/add.ts (existing — passes through unchanged)
```

Suggested orchestration in `service.ts`:

```typescript
const primaryResult = await primaryExtractor.extract(url);
if (
config.playwrightFallback &&
primaryResult.text.length < config.minContentLength
) {
const fallbackResult = await playwrightExtractor.extract(url);
emitTelemetry({ triggered: true, ...metrics });
return fallbackResult.text.length > 0 ? fallbackResult : primaryResult;
}
emitTelemetry({ triggered: false, primaryContentLength: primaryResult.text.length });
return primaryResult;
```

The above is illustrative. Exact method signatures must match the extractor interface defined in
`src/lib/ingestion/extractor.ts`.

### Dependency management
Add `playwright` (or `@playwright/test`) as an optional peer dependency so users who do not
enable the fallback do not need to install it. Guard the `require`/`import` behind a runtime
check to avoid import errors when the package is absent.

---

## Related Work and Code Pointers

Implementers should review the following before starting:

| Reference | Relevance |
|---|---|
| OB-0MN9HWGAL001452N — Ingest CLI: file and URL ingestion | Primary ingestion flow. Playwright output must be compatible with this pipeline. |
| OB-0MNFXR3E4005TGYX — Fix YouTube ingestion for ob add | Prior provider-specific fallback: tests, diagnostics, and error handling patterns to reuse. |
| OB-0MN9CZ48N0053L9Q — Create a full PRD for OpenBrain | Product-level guidance: local-first preferences and documented fallback policies. |
| `src/cli/commands/add.ts` | CLI entry point for `ob add`; Playwright output must be usable here. |
| `src/lib/ingestion/service.ts` | Ingestion orchestration; fallback logic belongs here. |
| `src/lib/ingestion/extractor.ts` | Extractor interface that PlaywrightExtractor must implement. |

---

## Constraints

- **Scope boundary:** Implementation belongs in the OpenBrain repo. SourceBase (this repository)
is the Discord bot integration layer and produced this document only.
- **Opt-in only:** Playwright must never run unless explicitly enabled; do not change default
behaviour of `ob add`.
- **No secrets in telemetry:** Telemetry fields must not include URL content, page text, cookies,
auth tokens, or any user-identifiable data.
- **Optional dependency:** Playwright is a heavy runtime dependency. Declare it as optional/peer
so existing installs are not forced to download browser binaries.
- **Record/playback for public CI:** Real browser launches must be gated; fixture-based tests must
be the default.

---

## Open Questions

1. Should the minimum-content-length threshold be per-domain (allow-list) or global? Global is
simpler; per-domain gives more control.
2. Which Playwright browser channel should be the default (`chromium`, `firefox`, `webkit`)?
Recommend `chromium` for widest compatibility, but make it configurable.
3. Should the fallback be retried on timeout, or fail immediately? Recommend fail-immediately on
first timeout to keep latency predictable.
4. Is there an existing structured-log / telemetry abstraction in OpenBrain to hook into, or does
the engineer need to introduce one?

---

## Acceptance Checklist (for PM / Engineering reviewers)

- [ ] Feature request document reviewed and approved
- [ ] Child work items created in OpenBrain: PlaywrightExtractor, CI test harness, telemetry
- [ ] Configuration flag name agreed (e.g. `ingestion.playwrightFallback`)
- [ ] Extractor interface reviewed; PlaywrightExtractor signature confirmed
- [ ] Record/playback fixture strategy confirmed or alternative agreed
- [ ] Self-hosted runner label and CI gate condition documented
- [ ] Optional-dependency approach confirmed (peer dep vs. dynamic import guard)
- [ ] Telemetry field list reviewed; privacy sign-off obtained
Loading