Feature/distill token optimization#3181
Conversation
a3cff79 to
98c0dd3
Compare
|
@claude review |
|
Claude finished @cliffhall's task in 1m 33s —— View job PR Review: Feature/distill token optimization
SummaryThe feature idea — reducing token usage from fetched web content — is a reasonable one, but this implementation has several significant technical problems that should be addressed before merging. Issues1. Redundant / counterproductive architecture (Major)The The 99.4% token reduction on MCP Docs (2,154 → 13 tokens) is a red flag here — 13 tokens is nearly empty output, suggesting core content is being stripped alongside boilerplate. The 2. ReDoS risk in the ad-pattern regex (Major / Security)r'<[^>]+(class|id)=["\'][^"\']*\b(ad|ads|...)\b[^"\']*["\'][^>]*>[\s\S]*?</[^>]+>'
3. Nested-tag regex is inherently broken (Major)The per-tag regex: html = re.sub(rf'<{tag}[^>]*>[\s\S]*?</{tag}>', '', html, flags=re.IGNORECASE)Fails silently on:
HTML cannot be reliably parsed with regex. This should use 4. No tests (Major)The repository already has a test suite for the fetch server ( 5. Undefined
|
ad34f31 to
cebaf79
Compare
Add distill parameter to aggressively clean HTML before processing: - Remove scripts, styles, navigation, headers, footers - Remove ads, sidebars, popups, cookie banners - Remove social widgets and non-content elements - Normalize whitespace Typical token reduction: 60-85% This is an opt-in feature (distill=false by default) to maintain backward compatibility.
a1ccb64 to
39e37b9
Compare
Addressed all review feedbackThanks for the thorough review! I've reworked the implementation to address every issue raised: Issue 1: Redundant / counterproductive architecture (Major) ✅Fixed.
This avoids interfering with Readability's content-detection signals and fixes the 13-token empty output problem. Issue 2: ReDoS risk in ad-pattern regex (Major / Security) ✅Fixed. Replaced all regex-based HTML parsing with BeautifulSoup ( Issue 3: Nested-tag regex is inherently broken (Major) ✅Fixed. BeautifulSoup handles nested tags, self-closing tags, and attributes containing Issue 4: No tests (Major) ✅Added. 21 pytest tests in
Issue 5: Undefined raw=True + distill=True interaction (Minor) ✅Documented. Added "Has no effect when
Issue 6: Performance — multiple regex passes (Minor) ✅Fixed. Single |
Description
Add optional
distillparameter to the fetch server that aggressively cleans HTML content to minimize token usage. When enabled, removes scripts, styles, navigation, headers, footers, ads, and other non-essential elements before conversion to markdown. Achieves 72.8% average token reduction across real-world tests.Server Details
distillparameter tofetchtool)Motivation and Context
LLM token costs are a significant operational expense. Current
web-fetchreturns full HTML including navigation menus, ads, scripts, and UI clutter - wasting tokens on non-content elements.Test Results:
How Has This Been Tested?
distillparameter work unchangedBreaking Changes
None. The
distillparameter defaults toFalse, maintaining full backward compatibility.Types of changes
Checklist
Additional context
No new dependencies required - uses existing
readabilipyfor content extraction.