Skip to content

M1-8: Backfill & Validation (re-crawl top 20 worst sources) #178

@jonesrussell

Description

@jonesrussell

Context

After deploying M1 components, we need to validate the improvements against real production data.

Deliverables

  • Re-crawl top 20 worst-performing sources with new pipeline
  • Compare word_count before/after
  • Validate template detection accuracy
  • Tune heuristics based on results
  • Document results in milestone completion report

Depends On

Tasks M1-1 through M1-5 deployed to production.

Success Criteria

  • Word count > 0 for >= 60% of raw_content docs (up from current 25%)
  • Postmedia chain: word count > 200 for >= 80% of article pages
  • Template detection accuracy >= 90% for known CMS types

Size: Medium (2-3 hours)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions