Limit full text scraper to download only new or updated articles by wofferl · Pull Request #3631 · nextcloud/news

wofferl · 2026-03-18T18:12:51Z

Resolves: Should not fetch every single article in feed every hour #3319

Summary

This is the second of a series of three pull requests to improve the full-text download feature.

The current implementation is unfair to content providers and gives the app a bad reputation as aggressive crawler.

Although the feed is only downloaded when necessary, depending on the settings, all referenced web pages are then downloaded again and again.

To prevent this, I have implemented a mechanism that ensures only new or updated content is downloaded, which also saves memory during the fetch.

This involves creating a list of known GUID hashes and their corresponding publication dates, which is used for comparison during the update job.
As some feeds do not have publication dates for the items and the current date is used in this case, these feed items are not updated and only downloaded once.

Here some example debug output showing one hourly fetch during my tests, the heise feed would currently fetch 160 websites per hour

heise online News added: 3, skipped: 157, error: 0
Items: 3 Memory used: 4 MB

tagesschau.de - Die Nachrichten der ARD added: 3, skipped: 37, error: 0
Items: 3 Memory used: 1 MB

taz.de - Artikel aus der Onlineausgabe added: 10, skipped: 10, error: 0
Items: 10 Memory used: 1 MB

Another way to avoid downloading unnecessary articles is coming with the third PR, which I am currently still testing. It allows individual articles to be updated on demand from the frontend via the backend controller and it advantages (readability, sanitize, etc)

Checklist

Code is properly formatted
Sign-off message is added to all commits
Changelog entry added for all important changes.

Copilot

Pull request overview

Implements a mechanism to prevent the full-text scraper from re-downloading articles that are already known (or unchanged), by passing a GUID-hash→date lookup into the feed fetch path.

Changes:

Extend IFeedFetcher::fetch() call chain to accept a known-item GUID hash list.
Add DB mapper method to fetch existing item GUID hashes + stored dates per feed, and use it to skip scraping unchanged items.
Update/add unit tests and changelog entry for the new behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/Unit/Service/FeedServiceTest.php	Updates expectations for the new `fetch()` signature.
tests/Unit/Fetcher/FeedFetcherTest.php	Updates existing tests for the new parameter and adds new full-text skip/scrape test cases.
lib/Service/FeedServiceV2.php	Builds and passes known GUID-hash list to fetcher during updates; updates `create()` calls for new signature.
lib/Fetcher/IFeedFetcher.php	Extends interface signature/docs with `$guidHashList`.
lib/Fetcher/Fetcher.php	Propagates the new argument through the fetcher dispatcher.
lib/Fetcher/FeedFetcher.php	Adds skip logic for full-text scraping based on known GUID hashes and timestamps.
lib/Db/FeedMapperV2.php	Adds query to fetch `guid_hash` and `pub_date` for a feed’s existing items.
CHANGELOG.md	Adds unreleased changelog entry.

You can also share your feedback on Copilot code review. Take the survey.

Signed-off-by: Wolfgang <github@linux-dude.de>

Changed - Refactor full text scraper to use guzzle http client and its admin settings `Maximum redirects` and `Feed fetcher timeout` (#3630) - Limit full text scraper to download only new or updated articles (#3631) Fixed - Some feeds are no longer being updated because the job is terminating due to incorrect encoding handling in the full text scraper (#3630) Signed-off-by: Benjamin Brahmer <info@b-brahmer.de>

wofferl added the API Impact API/Backend code label Mar 18, 2026

wofferl requested a review from Copilot March 18, 2026 18:12

Copilot AI reviewed Mar 18, 2026

View reviewed changes

Comment thread lib/Fetcher/FeedFetcher.php Outdated

Comment thread lib/Service/FeedServiceV2.php

Comment thread lib/Fetcher/FeedFetcher.php Outdated

Comment thread tests/Unit/Fetcher/FeedFetcherTest.php

Comment thread lib/Db/FeedMapperV2.php

wofferl added the 2. developing label Mar 18, 2026

feat: limit full text scraper to download only new or updated articles

0718360

Signed-off-by: Wolfgang <github@linux-dude.de>

wofferl force-pushed the feat_limit_scraper branch from 8a17f90 to 0718360 Compare March 19, 2026 17:58

wofferl added enhancement 3. to review and removed 2. developing labels Mar 19, 2026

wofferl marked this pull request as ready for review March 19, 2026 18:17

Grotax approved these changes Mar 22, 2026

View reviewed changes

Grotax merged commit cc54f74 into nextcloud:master Mar 22, 2026
31 checks passed

Grotax mentioned this pull request Mar 22, 2026

Release 28.2.0-beta.1 #3639

Merged

wofferl deleted the feat_limit_scraper branch April 30, 2026 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit full text scraper to download only new or updated articles#3631

Limit full text scraper to download only new or updated articles#3631
Grotax merged 1 commit into
nextcloud:masterfrom
wofferl:feat_limit_scraper

wofferl commented Mar 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wofferl commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wofferl commented Mar 18, 2026 •

edited

Loading