Skip to content

Limit full text scraper to download only new or updated articles#3631

Merged
Grotax merged 1 commit into
nextcloud:masterfrom
wofferl:feat_limit_scraper
Mar 22, 2026
Merged

Limit full text scraper to download only new or updated articles#3631
Grotax merged 1 commit into
nextcloud:masterfrom
wofferl:feat_limit_scraper

Conversation

@wofferl
Copy link
Copy Markdown
Collaborator

@wofferl wofferl commented Mar 18, 2026

Summary

This is the second of a series of three pull requests to improve the full-text download feature.

The current implementation is unfair to content providers and gives the app a bad reputation as aggressive crawler.

Although the feed is only downloaded when necessary, depending on the settings, all referenced web pages are then downloaded again and again.

To prevent this, I have implemented a mechanism that ensures only new or updated content is downloaded, which also saves memory during the fetch.

This involves creating a list of known GUID hashes and their corresponding publication dates, which is used for comparison during the update job.
As some feeds do not have publication dates for the items and the current date is used in this case, these feed items are not updated and only downloaded once.

Here some example debug output showing one hourly fetch during my tests, the heise feed would currently fetch 160 websites per hour

heise online News added: 3, skipped: 157, error: 0
Items: 3 Memory used: 4 MB

tagesschau.de - Die Nachrichten der ARD added: 3, skipped: 37, error: 0
Items: 3 Memory used: 1 MB

taz.de - Artikel aus der Onlineausgabe added: 10, skipped: 10, error: 0
Items: 10 Memory used: 1 MB

Another way to avoid downloading unnecessary articles is coming with the third PR, which I am currently still testing. It allows individual articles to be updated on demand from the frontend via the backend controller and it advantages (readability, sanitize, etc)

Checklist

@wofferl wofferl added the API Impact API/Backend code label Mar 18, 2026
@wofferl wofferl requested a review from Copilot March 18, 2026 18:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a mechanism to prevent the full-text scraper from re-downloading articles that are already known (or unchanged), by passing a GUID-hash→date lookup into the feed fetch path.

Changes:

  • Extend IFeedFetcher::fetch() call chain to accept a known-item GUID hash list.
  • Add DB mapper method to fetch existing item GUID hashes + stored dates per feed, and use it to skip scraping unchanged items.
  • Update/add unit tests and changelog entry for the new behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/Unit/Service/FeedServiceTest.php Updates expectations for the new fetch() signature.
tests/Unit/Fetcher/FeedFetcherTest.php Updates existing tests for the new parameter and adds new full-text skip/scrape test cases.
lib/Service/FeedServiceV2.php Builds and passes known GUID-hash list to fetcher during updates; updates create() calls for new signature.
lib/Fetcher/IFeedFetcher.php Extends interface signature/docs with $guidHashList.
lib/Fetcher/Fetcher.php Propagates the new argument through the fetcher dispatcher.
lib/Fetcher/FeedFetcher.php Adds skip logic for full-text scraping based on known GUID hashes and timestamps.
lib/Db/FeedMapperV2.php Adds query to fetch guid_hash and pub_date for a feed’s existing items.
CHANGELOG.md Adds unreleased changelog entry.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread lib/Fetcher/FeedFetcher.php Outdated
Comment thread lib/Service/FeedServiceV2.php
Comment thread lib/Fetcher/FeedFetcher.php Outdated
Comment thread tests/Unit/Fetcher/FeedFetcherTest.php
Comment thread lib/Db/FeedMapperV2.php
Signed-off-by: Wolfgang <github@linux-dude.de>
@wofferl wofferl force-pushed the feat_limit_scraper branch from 8a17f90 to 0718360 Compare March 19, 2026 17:58
@wofferl wofferl marked this pull request as ready for review March 19, 2026 18:17
@Grotax Grotax merged commit cc54f74 into nextcloud:master Mar 22, 2026
31 checks passed
Grotax added a commit that referenced this pull request Mar 22, 2026
Changed
- Refactor full text scraper to use guzzle http client and its admin settings `Maximum redirects` and `Feed fetcher timeout` (#3630)
- Limit full text scraper to download only new or updated articles (#3631)

Fixed
- Some feeds are no longer being updated because the job is terminating due to incorrect encoding handling in the full text scraper (#3630)

Signed-off-by: Benjamin Brahmer <info@b-brahmer.de>
@Grotax Grotax mentioned this pull request Mar 22, 2026
Grotax added a commit that referenced this pull request Mar 22, 2026
Changed
- Refactor full text scraper to use guzzle http client and its admin settings `Maximum redirects` and `Feed fetcher timeout` (#3630)
- Limit full text scraper to download only new or updated articles (#3631)

Fixed
- Some feeds are no longer being updated because the job is terminating due to incorrect encoding handling in the full text scraper (#3630)

Signed-off-by: Benjamin Brahmer <info@b-brahmer.de>
@wofferl wofferl deleted the feat_limit_scraper branch April 30, 2026 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3. to review API Impact API/Backend code enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Should not fetch every single article in feed every hour

3 participants