Limit full text scraper to download only new or updated articles#3631
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Implements a mechanism to prevent the full-text scraper from re-downloading articles that are already known (or unchanged), by passing a GUID-hash→date lookup into the feed fetch path.
Changes:
- Extend
IFeedFetcher::fetch()call chain to accept a known-item GUID hash list. - Add DB mapper method to fetch existing item GUID hashes + stored dates per feed, and use it to skip scraping unchanged items.
- Update/add unit tests and changelog entry for the new behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/Unit/Service/FeedServiceTest.php | Updates expectations for the new fetch() signature. |
| tests/Unit/Fetcher/FeedFetcherTest.php | Updates existing tests for the new parameter and adds new full-text skip/scrape test cases. |
| lib/Service/FeedServiceV2.php | Builds and passes known GUID-hash list to fetcher during updates; updates create() calls for new signature. |
| lib/Fetcher/IFeedFetcher.php | Extends interface signature/docs with $guidHashList. |
| lib/Fetcher/Fetcher.php | Propagates the new argument through the fetcher dispatcher. |
| lib/Fetcher/FeedFetcher.php | Adds skip logic for full-text scraping based on known GUID hashes and timestamps. |
| lib/Db/FeedMapperV2.php | Adds query to fetch guid_hash and pub_date for a feed’s existing items. |
| CHANGELOG.md | Adds unreleased changelog entry. |
You can also share your feedback on Copilot code review. Take the survey.
Signed-off-by: Wolfgang <github@linux-dude.de>
8a17f90 to
0718360
Compare
Grotax
approved these changes
Mar 22, 2026
Grotax
added a commit
that referenced
this pull request
Mar 22, 2026
Changed - Refactor full text scraper to use guzzle http client and its admin settings `Maximum redirects` and `Feed fetcher timeout` (#3630) - Limit full text scraper to download only new or updated articles (#3631) Fixed - Some feeds are no longer being updated because the job is terminating due to incorrect encoding handling in the full text scraper (#3630) Signed-off-by: Benjamin Brahmer <info@b-brahmer.de>
Merged
Grotax
added a commit
that referenced
this pull request
Mar 22, 2026
Changed - Refactor full text scraper to use guzzle http client and its admin settings `Maximum redirects` and `Feed fetcher timeout` (#3630) - Limit full text scraper to download only new or updated articles (#3631) Fixed - Some feeds are no longer being updated because the job is terminating due to incorrect encoding handling in the full text scraper (#3630) Signed-off-by: Benjamin Brahmer <info@b-brahmer.de>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is the second of a series of three pull requests to improve the full-text download feature.
The current implementation is unfair to content providers and gives the app a bad reputation as aggressive crawler.
Although the feed is only downloaded when necessary, depending on the settings, all referenced web pages are then downloaded again and again.
To prevent this, I have implemented a mechanism that ensures only new or updated content is downloaded, which also saves memory during the fetch.
This involves creating a list of known GUID hashes and their corresponding publication dates, which is used for comparison during the update job.
As some feeds do not have publication dates for the items and the current date is used in this case, these feed items are not updated and only downloaded once.
Here some example debug output showing one hourly fetch during my tests, the heise feed would currently fetch 160 websites per hour
Another way to avoid downloading unnecessary articles is coming with the third PR, which I am currently still testing. It allows individual articles to be updated on demand from the frontend via the backend controller and it advantages (readability, sanitize, etc)
Checklist