refactor(data): Improve caching and trending repo fetching by rainxchzed · Pull Request #107 · OpenHub-Store/GitHub-Store

rainxchzed · 2025-12-29T06:21:47Z

This commit refactors the data layer for fetching and handling cached trending repositories, improving both the client-side parsing and the backend script that generates the data.

Key Changes:

Data Fetching & Parsing:

Introduced CachedGithubRepoSummary and CachedGithubOwner data classes to precisely match the structure of the pre-cached JSON files. This prevents parsing errors if the full GithubRepoSummary model contains fields not present in the cached data.
Added a toGithubRepoSummary() extension function to map the cached data model to the domain model used in the app.
Enhanced logging in CachedTrendingDataSource with more detailed messages for success, failures (404), timeouts, and serialization errors to improve debugging.
Removed the ContentNegotiation plugin from the dedicated HttpClient in CachedTrendingDataSource to handle JSON parsing manually, providing better error handling.

Backend Script (fetch_trending.py):

Implemented a more robust, multi-attempt search strategy to find a sufficient number of relevant repositories.
The script now progressively broadens its search criteria (widening the date range, lowering the star requirement, and eventually dropping topic filters) across multiple attempts if not enough results are found initially.
The desired number of repositories to fetch per platform has been increased from 30 to 80 to provide a richer dataset.
The logic now tracks repositories that have already been checked (in a seen set) to avoid redundant API calls.
The final list of repositories is sorted by star count before being saved.

CI/CD (fetch-trending-repos.yml):

The cron schedule for the trending repositories job has been changed from every 6 hours to every 12 hours to reduce build frequency.
The Git commit-and-push logic is simplified to use git commit || echo "No changes to commit" to gracefully handle cases where no data has changed, removing the need for a separate check step.

Summary by CodeRabbit

New Features
- Implemented caching mechanism for trending repository data to improve performance and reliability.
- Enhanced data fetching with automatic retry logic and exponential backoff for resilient API interactions.
- Expanded data collection from 30 to 80 repositories per platform for more comprehensive results.
Bug Fixes
- Improved error handling for timeout, serialization, and server errors during repository fetching.
Chores
- Updated automated trending data update schedule from every 6 hours to every 12 hours.
- Updated dependency version for improved compatibility.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

This commit refactors the data layer for fetching and handling cached trending repositories, improving both the client-side parsing and the backend script that generates the data. ### Key Changes: **Data Fetching & Parsing:** - Introduced `CachedGithubRepoSummary` and `CachedGithubOwner` data classes to precisely match the structure of the pre-cached JSON files. This prevents parsing errors if the full `GithubRepoSummary` model contains fields not present in the cached data. - Added a `toGithubRepoSummary()` extension function to map the cached data model to the domain model used in the app. - Enhanced logging in `CachedTrendingDataSource` with more detailed messages for success, failures (404), timeouts, and serialization errors to improve debugging. - Removed the `ContentNegotiation` plugin from the dedicated `HttpClient` in `CachedTrendingDataSource` to handle JSON parsing manually, providing better error handling. **Backend Script (`fetch_trending.py`):** - Implemented a more robust, multi-attempt search strategy to find a sufficient number of relevant repositories. - The script now progressively broadens its search criteria (widening the date range, lowering the star requirement, and eventually dropping topic filters) across multiple attempts if not enough results are found initially. - The desired number of repositories to fetch per platform has been increased from 30 to 80 to provide a richer dataset. - The logic now tracks repositories that have already been checked (in a `seen` set) to avoid redundant API calls. - The final list of repositories is sorted by star count before being saved. **CI/CD (`fetch-trending-repos.yml`):** - The cron schedule for the trending repositories job has been changed from every 6 hours to every 12 hours to reduce build frequency. - The Git commit-and-push logic is simplified to use `git commit || echo "No changes to commit"` to gracefully handle cases where no data has changed, removing the need for a separate check step.

coderabbitai · 2025-12-29T06:21:59Z

Walkthrough

The pull request enhances the trending repository data collection system by restructuring the workflow schedule (6 hours to 12 hours), introducing a cached data model layer with explicit JSON parsing, implementing retry-enabled multi-pass search logic in the fetch script, and converting cached models to domain objects in the repository layer.

Changes

Cohort / File(s)	Summary
Workflow Automation `.github/workflows/fetch-trending-repos.yml`	Changed cron schedule from every 6 hours to every 12 hours. Removed explicit change detection and conditional branching; replaced with a streamlined git pull--rebase, stage, commit (with no-op fallback), and unconditional push flow. Added pre-commit pull to update local branch before staging.
Cached Data Models `composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/data_source/CachedTrendingDataSource.kt`	Introduced new public API `getCachedTrendingRepos()` and three data classes (`CachedRepoResponse`, `CachedGithubRepoSummary`, `CachedGithubOwner`) to represent cached repository metadata. Added extension function `toGithubRepoSummary()` for model conversion. Switched from ContentNegotiation to manual JSON parsing via `Json.decodeFromString()`. Enhanced logging and error handling with emoji markers and additional context.
Repository Mapping `composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/repository/HomeRepositoryImpl.kt`	Updated cached repository emission to apply `toGithubRepoSummary()` conversion, transforming cached models to domain entities before emitting in `PaginatedRepos`.
Trending Data Fetching `scripts/fetch_trending.py`	Rewrote fetch logic with multi-pass search strategy, configurable parameters (days window, stars threshold, topics), and exponential backoff retry handler (`make_request_with_retry`). Implemented deduplication via seen set and tightened candidate filtering (score >= 5). Increased default `desired_count` from 30 to 80. Expanded final result to include complete repository summary metadata (id, name, fullName, owner, stargazersCount, etc.).
Dependency Management `scripts/requirements.txt`	Downgraded requests package from 2.32.4 to 2.32.3.

Sequence Diagram(s)

sequenceDiagram
    participant GH as GitHub Actions<br/>(Workflow)
    participant Fetch as fetch_trending.py<br/>(Multi-pass Fetch)
    participant GitAPI as GitHub API
    participant Cache as CachedTrendingDataSource
    participant JSON as JSON Parser
    participant Repo as HomeRepositoryImpl
    participant Domain as Domain Model

    rect rgb(220, 240, 255)
    Note over GH: Every 12 hours
    GH->>GH: git pull --rebase
    end

    rect rgb(230, 245, 230)
    Note over Fetch: Multi-pass Search<br/>(Dynamic params)
    loop Attempts: days, stars, topics
        Fetch->>Fetch: make_request_with_retry()
        Fetch->>GitAPI: Search Query<br/>(with exponential backoff)
        GitAPI-->>Fetch: Results
        Fetch->>Fetch: Score & Filter<br/>(score >= 5)
        Fetch->>Fetch: Dedup via seen set
        Fetch->>Fetch: Check installers<br/>(top 50)
    end
    Fetch->>Fetch: Build summary objects<br/>(complete metadata)
    Fetch->>Fetch: Sort by stargazers<br/>(stable order)
    end

    rect rgb(255, 240, 220)
    Note over Cache: Caching Layer
    Fetch->>Cache: Store CachedRepoResponse<br/>(cached data)
    Cache->>JSON: Parse response
    JSON-->>Cache: CachedRepoResponse objects
    end

    rect rgb(245, 220, 245)
    Note over Repo: Domain Mapping
    Cache->>Repo: getCachedTrendingRepos()
    Repo->>Domain: toGithubRepoSummary()<br/>(convert models)
    Domain-->>Repo: GithubRepoSummary list
    Repo-->>Repo: Emit PaginatedRepos
    end

    rect rgb(220, 240, 255)
    Note over GH: Finalize
    GH->>GH: Stage cached-data/
    GH->>GH: Commit (if changes)
    GH->>GH: Push to remote
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Trending repos automate #106: Both PRs modify the trending data fetch workflow, Python fetch logic, and introduce or extend cached data models and repository mapping in Kotlin; this PR builds upon and extends the architecture introduced in #106.

Poem

🐰 A rabbit's tribute to caching and retries...

With twelve-hour hops through data streams so wide,
We cache the trends and let conversions guide—
Retry with backoff, deduplicate with care,
Parse JSON swiftly, domain models to share! 🌟

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: refactoring data layer with improvements to caching and trending repo fetching, which directly aligns with the changeset scope.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch update-trending-automation

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/data_source/CachedTrendingDataSource.kt (1)

52-109: The close() method on CachedTrendingDataSource is never invoked in the codebase.

The class has a close() method (lines 114-116) that properly closes the HttpClient, but no consumer of CachedTrendingDataSource calls it. Since the class is registered as a singleton in Koin's DI container (SharedModules.kt), the HttpClient and its connection pools remain open for the entire application lifetime, leaking resources until process termination. Either implement Closeable/AutoCloseable with proper DI scope cleanup, or ensure the repository or application lifecycle handler explicitly calls cachedDataSource.close() on shutdown.

🧹 Nitpick comments (3)

scripts/fetch_trending.py (2)
312-318: Redundant seen.add() call on line 313.

Line 318 adds full_name to seen unconditionally after both branches, making line 313 redundant. The set will handle duplicates, but it's cleaner to have a single add.
Proposed fix
                         results.append(summary)
-                        seen.add(full_name)
                         print(f"✓ Found ({len(results)}/{desired_count}) {full_name}")
                     else:
                         print(f"✗ No installers {full_name}")

                     seen.add(full_name)  # Add to seen even if no installers to avoid rechecking
323-325: Consider narrowing the exception type.

While catching broad Exception here ensures resilience, you could narrow it to expected types like (json.JSONDecodeError, KeyError, TypeError) for better error specificity. This is optional since the current approach logs the error and continues gracefully.
composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/data_source/CachedTrendingDataSource.kt (1)
164-174: Hardcoded id = 0 for owner may cause issues if used as a key.

If GithubUser.id is used in equality checks, as a map key, or for diffing in UI lists, all cached repo owners will collide. Consider using a deterministic placeholder (e.g., owner.login.hashCode().toLong()) or ensure downstream code doesn't rely on owner id for uniqueness.
🔎 Alternative using deterministic ID
         owner = GithubUser(
-            id = 0,
+            id = owner.login.hashCode().toLong(),
             login = owner.login,
             avatarUrl = owner.avatarUrl,
             htmlUrl = "https://github.com/${owner.login}"
         ),

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ecbdc5d and f4a4922.

📒 Files selected for processing (5)

.github/workflows/fetch-trending-repos.yml
composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/data_source/CachedTrendingDataSource.kt
composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/repository/HomeRepositoryImpl.kt
scripts/fetch_trending.py
scripts/requirements.txt

🧰 Additional context used

🪛 Ruff (0.14.10)

scripts/fetch_trending.py

221-221: Local variable min_count is assigned to but never used

Remove assignment to unused variable min_count

(F841)

323-323: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (11)

.github/workflows/fetch-trending-repos.yml (1)

39-42: Rebase failure could cause silent push issues.

If git pull --rebase encounters conflicts, the workflow will fail at that step. However, the current flow handles the common case well. Consider adding || exit 1 to make failures explicit, though this may be acceptable for the use case.

One minor note: if the commit step outputs "No changes to commit", the subsequent git push will still execute but is a harmless no-op (pushes current HEAD which is already up-to-date).

scripts/fetch_trending.py (3)

44-132: Retry logic is well-implemented.

The exponential backoff with rate limit handling, server error retries, and timeout handling covers the expected failure modes for GitHub API interactions robustly.

223-234: Multi-pass search strategy is well-designed.

The progressive broadening (wider date range, lower star threshold, dropping topic filters) effectively increases candidate pool when initial results are insufficient.

327-329: Final sorting ensures consistent output.

Sorting by stargazersCount before truncating ensures the most starred repositories are retained regardless of discovery order.

composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/data_source/CachedTrendingDataSource.kt (5)

3-20: LGTM!

The imports are well-organized and align with the refactored HTTP handling approach (manual JSON parsing, explicit timeout/retry plugins).

30-44: LGTM!

The HttpClient configuration is well-tuned: reasonable timeouts, exponential backoff retry on server errors, and expectSuccess = false for manual response handling.

130-149: Same naming convention concern applies here.

Fields like fullName, defaultBranch, htmlUrl, stargazersCount, forksCount, releasesUrl, and updatedAt would typically be snake_case in raw GitHub API data. If the cached JSON uses snake_case, add @SerialName annotations (e.g., @SerialName("full_name") val fullName: String).

151-159: Field avatarUrl likely needs @SerialName("avatar_url").

Same snake_case concern applies. The verification script above will confirm the actual JSON structure.

119-128: No action needed. The cached JSON files use camelCase field names (e.g., lastUpdated, totalCount, fullName, defaultBranch, htmlUrl, stargazersCount, forksCount, updatedAt) that exactly match the property names in CachedRepoResponse and CachedGithubRepoSummary. The @SerialName annotations are not required, and the serialization works correctly with the current configuration.

Likely an incorrect or invalid review comment.

composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/repository/HomeRepositoryImpl.kt (2)

35-35: LGTM!

Import correctly added for the toGithubRepoSummary extension function.

61-65: LGTM!

The mapping correctly converts cached repository models to domain models before emission, ensuring type consistency throughout the data layer.

coderabbitai · 2025-12-29T06:26:40Z

+    seen: set = set()
+    attempt = 0
+    max_attempts = 4
+    min_count = 10  # Ensure at least this many if possible


⚠️ Potential issue | 🟡 Minor

Remove unused variable min_count.

This variable is assigned but never used. Per static analysis hint.

Proposed fix

- min_count = 10 # Ensure at least this many if possible

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

min_count = 10 # Ensure at least this many if possible

🧰 Tools

🪛 Ruff (0.14.10)

221-221: Local variable min_count is assigned to but never used

Remove assignment to unused variable min_count

(F841)

🤖 Prompt for AI Agents

In scripts/fetch_trending.py around line 221, the variable assignment "min_count = 10" is unused; remove the unused variable declaration (or if intended, apply it where the count of items is determined—e.g., enforce a minimum result count by using min_count in that filtering/selection logic). Update the code to either delete the line or integrate min_count into the selection logic so it is referenced and no longer triggers a static-analysis warning.

This commit updates the version of the `requests` library in the project's requirements.

coderabbitai Bot reviewed Dec 29, 2025

View reviewed changes

chore: Bump requests from 2.32.3 to 2.32.4

81399e4

This commit updates the version of the `requests` library in the project's requirements.

rainxchzed merged commit 99569cb into main Dec 29, 2025
2 checks passed

This was referenced Feb 13, 2026

chore: Remove trending repository fetch workflow and scripts #237

Merged

feat(home): Auto-scroll to top on category switch #240

Merged

coderabbitai Bot mentioned this pull request Mar 20, 2026

feat: implement cross-platform repository fetching and merging #340

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(data): Improve caching and trending repo fetching#107

refactor(data): Improve caching and trending repo fetching#107
rainxchzed merged 2 commits into
mainfrom
update-trending-automation

rainxchzed commented Dec 29, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Dec 29, 2025 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Dec 29, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rainxchzed commented Dec 29, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes:

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rainxchzed commented Dec 29, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Dec 29, 2025 •

edited

Loading