Skip to content

refactor(data): Improve caching and trending repo fetching#107

Merged
rainxchzed merged 2 commits into
mainfrom
update-trending-automation
Dec 29, 2025
Merged

refactor(data): Improve caching and trending repo fetching#107
rainxchzed merged 2 commits into
mainfrom
update-trending-automation

Conversation

@rainxchzed
Copy link
Copy Markdown
Member

@rainxchzed rainxchzed commented Dec 29, 2025

This commit refactors the data layer for fetching and handling cached trending repositories, improving both the client-side parsing and the backend script that generates the data.

Key Changes:

Data Fetching & Parsing:

  • Introduced CachedGithubRepoSummary and CachedGithubOwner data classes to precisely match the structure of the pre-cached JSON files. This prevents parsing errors if the full GithubRepoSummary model contains fields not present in the cached data.
  • Added a toGithubRepoSummary() extension function to map the cached data model to the domain model used in the app.
  • Enhanced logging in CachedTrendingDataSource with more detailed messages for success, failures (404), timeouts, and serialization errors to improve debugging.
  • Removed the ContentNegotiation plugin from the dedicated HttpClient in CachedTrendingDataSource to handle JSON parsing manually, providing better error handling.

Backend Script (fetch_trending.py):

  • Implemented a more robust, multi-attempt search strategy to find a sufficient number of relevant repositories.
  • The script now progressively broadens its search criteria (widening the date range, lowering the star requirement, and eventually dropping topic filters) across multiple attempts if not enough results are found initially.
  • The desired number of repositories to fetch per platform has been increased from 30 to 80 to provide a richer dataset.
  • The logic now tracks repositories that have already been checked (in a seen set) to avoid redundant API calls.
  • The final list of repositories is sorted by star count before being saved.

CI/CD (fetch-trending-repos.yml):

  • The cron schedule for the trending repositories job has been changed from every 6 hours to every 12 hours to reduce build frequency.
  • The Git commit-and-push logic is simplified to use git commit || echo "No changes to commit" to gracefully handle cases where no data has changed, removing the need for a separate check step.

Summary by CodeRabbit

  • New Features

    • Implemented caching mechanism for trending repository data to improve performance and reliability.
    • Enhanced data fetching with automatic retry logic and exponential backoff for resilient API interactions.
    • Expanded data collection from 30 to 80 repositories per platform for more comprehensive results.
  • Bug Fixes

    • Improved error handling for timeout, serialization, and server errors during repository fetching.
  • Chores

    • Updated automated trending data update schedule from every 6 hours to every 12 hours.
    • Updated dependency version for improved compatibility.

✏️ Tip: You can customize this high-level summary in your review settings.

This commit refactors the data layer for fetching and handling cached trending repositories, improving both the client-side parsing and the backend script that generates the data.

### Key Changes:

**Data Fetching & Parsing:**
- Introduced `CachedGithubRepoSummary` and `CachedGithubOwner` data classes to precisely match the structure of the pre-cached JSON files. This prevents parsing errors if the full `GithubRepoSummary` model contains fields not present in the cached data.
- Added a `toGithubRepoSummary()` extension function to map the cached data model to the domain model used in the app.
- Enhanced logging in `CachedTrendingDataSource` with more detailed messages for success, failures (404), timeouts, and serialization errors to improve debugging.
- Removed the `ContentNegotiation` plugin from the dedicated `HttpClient` in `CachedTrendingDataSource` to handle JSON parsing manually, providing better error handling.

**Backend Script (`fetch_trending.py`):**
- Implemented a more robust, multi-attempt search strategy to find a sufficient number of relevant repositories.
- The script now progressively broadens its search criteria (widening the date range, lowering the star requirement, and eventually dropping topic filters) across multiple attempts if not enough results are found initially.
- The desired number of repositories to fetch per platform has been increased from 30 to 80 to provide a richer dataset.
- The logic now tracks repositories that have already been checked (in a `seen` set) to avoid redundant API calls.
- The final list of repositories is sorted by star count before being saved.

**CI/CD (`fetch-trending-repos.yml`):**
- The cron schedule for the trending repositories job has been changed from every 6 hours to every 12 hours to reduce build frequency.
- The Git commit-and-push logic is simplified to use `git commit || echo "No changes to commit"` to gracefully handle cases where no data has changed, removing the need for a separate check step.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 29, 2025

Walkthrough

The pull request enhances the trending repository data collection system by restructuring the workflow schedule (6 hours to 12 hours), introducing a cached data model layer with explicit JSON parsing, implementing retry-enabled multi-pass search logic in the fetch script, and converting cached models to domain objects in the repository layer.

Changes

Cohort / File(s) Summary
Workflow Automation
.github/workflows/fetch-trending-repos.yml
Changed cron schedule from every 6 hours to every 12 hours. Removed explicit change detection and conditional branching; replaced with a streamlined git pull--rebase, stage, commit (with no-op fallback), and unconditional push flow. Added pre-commit pull to update local branch before staging.
Cached Data Models
composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/data_source/CachedTrendingDataSource.kt
Introduced new public API getCachedTrendingRepos() and three data classes (CachedRepoResponse, CachedGithubRepoSummary, CachedGithubOwner) to represent cached repository metadata. Added extension function toGithubRepoSummary() for model conversion. Switched from ContentNegotiation to manual JSON parsing via Json.decodeFromString(). Enhanced logging and error handling with emoji markers and additional context.
Repository Mapping
composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/repository/HomeRepositoryImpl.kt
Updated cached repository emission to apply toGithubRepoSummary() conversion, transforming cached models to domain entities before emitting in PaginatedRepos.
Trending Data Fetching
scripts/fetch_trending.py
Rewrote fetch logic with multi-pass search strategy, configurable parameters (days window, stars threshold, topics), and exponential backoff retry handler (make_request_with_retry). Implemented deduplication via seen set and tightened candidate filtering (score >= 5). Increased default desired_count from 30 to 80. Expanded final result to include complete repository summary metadata (id, name, fullName, owner, stargazersCount, etc.).
Dependency Management
scripts/requirements.txt
Downgraded requests package from 2.32.4 to 2.32.3.

Sequence Diagram(s)

sequenceDiagram
    participant GH as GitHub Actions<br/>(Workflow)
    participant Fetch as fetch_trending.py<br/>(Multi-pass Fetch)
    participant GitAPI as GitHub API
    participant Cache as CachedTrendingDataSource
    participant JSON as JSON Parser
    participant Repo as HomeRepositoryImpl
    participant Domain as Domain Model

    rect rgb(220, 240, 255)
    Note over GH: Every 12 hours
    GH->>GH: git pull --rebase
    end

    rect rgb(230, 245, 230)
    Note over Fetch: Multi-pass Search<br/>(Dynamic params)
    loop Attempts: days, stars, topics
        Fetch->>Fetch: make_request_with_retry()
        Fetch->>GitAPI: Search Query<br/>(with exponential backoff)
        GitAPI-->>Fetch: Results
        Fetch->>Fetch: Score & Filter<br/>(score >= 5)
        Fetch->>Fetch: Dedup via seen set
        Fetch->>Fetch: Check installers<br/>(top 50)
    end
    Fetch->>Fetch: Build summary objects<br/>(complete metadata)
    Fetch->>Fetch: Sort by stargazers<br/>(stable order)
    end

    rect rgb(255, 240, 220)
    Note over Cache: Caching Layer
    Fetch->>Cache: Store CachedRepoResponse<br/>(cached data)
    Cache->>JSON: Parse response
    JSON-->>Cache: CachedRepoResponse objects
    end

    rect rgb(245, 220, 245)
    Note over Repo: Domain Mapping
    Cache->>Repo: getCachedTrendingRepos()
    Repo->>Domain: toGithubRepoSummary()<br/>(convert models)
    Domain-->>Repo: GithubRepoSummary list
    Repo-->>Repo: Emit PaginatedRepos
    end

    rect rgb(220, 240, 255)
    Note over GH: Finalize
    GH->>GH: Stage cached-data/
    GH->>GH: Commit (if changes)
    GH->>GH: Push to remote
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • Trending repos automate #106: Both PRs modify the trending data fetch workflow, Python fetch logic, and introduce or extend cached data models and repository mapping in Kotlin; this PR builds upon and extends the architecture introduced in #106.

Poem

🐰 A rabbit's tribute to caching and retries...

With twelve-hour hops through data streams so wide,
We cache the trends and let conversions guide—
Retry with backoff, deduplicate with care,
Parse JSON swiftly, domain models to share! 🌟

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: refactoring data layer with improvements to caching and trending repo fetching, which directly aligns with the changeset scope.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch update-trending-automation

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/data_source/CachedTrendingDataSource.kt (1)

52-109: The close() method on CachedTrendingDataSource is never invoked in the codebase.

The class has a close() method (lines 114-116) that properly closes the HttpClient, but no consumer of CachedTrendingDataSource calls it. Since the class is registered as a singleton in Koin's DI container (SharedModules.kt), the HttpClient and its connection pools remain open for the entire application lifetime, leaking resources until process termination. Either implement Closeable/AutoCloseable with proper DI scope cleanup, or ensure the repository or application lifecycle handler explicitly calls cachedDataSource.close() on shutdown.

🧹 Nitpick comments (3)
scripts/fetch_trending.py (2)

312-318: Redundant seen.add() call on line 313.

Line 318 adds full_name to seen unconditionally after both branches, making line 313 redundant. The set will handle duplicates, but it's cleaner to have a single add.

Proposed fix
                         results.append(summary)
-                        seen.add(full_name)
                         print(f"✓ Found ({len(results)}/{desired_count}) {full_name}")
                     else:
                         print(f"✗ No installers {full_name}")

                     seen.add(full_name)  # Add to seen even if no installers to avoid rechecking

323-325: Consider narrowing the exception type.

While catching broad Exception here ensures resilience, you could narrow it to expected types like (json.JSONDecodeError, KeyError, TypeError) for better error specificity. This is optional since the current approach logs the error and continues gracefully.

composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/data_source/CachedTrendingDataSource.kt (1)

164-174: Hardcoded id = 0 for owner may cause issues if used as a key.

If GithubUser.id is used in equality checks, as a map key, or for diffing in UI lists, all cached repo owners will collide. Consider using a deterministic placeholder (e.g., owner.login.hashCode().toLong()) or ensure downstream code doesn't rely on owner id for uniqueness.

🔎 Alternative using deterministic ID
         owner = GithubUser(
-            id = 0,
+            id = owner.login.hashCode().toLong(),
             login = owner.login,
             avatarUrl = owner.avatarUrl,
             htmlUrl = "https://github.com/${owner.login}"
         ),
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ecbdc5d and f4a4922.

📒 Files selected for processing (5)
  • .github/workflows/fetch-trending-repos.yml
  • composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/data_source/CachedTrendingDataSource.kt
  • composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/repository/HomeRepositoryImpl.kt
  • scripts/fetch_trending.py
  • scripts/requirements.txt
🧰 Additional context used
🪛 Ruff (0.14.10)
scripts/fetch_trending.py

221-221: Local variable min_count is assigned to but never used

Remove assignment to unused variable min_count

(F841)


323-323: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (11)
.github/workflows/fetch-trending-repos.yml (1)

39-42: Rebase failure could cause silent push issues.

If git pull --rebase encounters conflicts, the workflow will fail at that step. However, the current flow handles the common case well. Consider adding || exit 1 to make failures explicit, though this may be acceptable for the use case.

One minor note: if the commit step outputs "No changes to commit", the subsequent git push will still execute but is a harmless no-op (pushes current HEAD which is already up-to-date).

scripts/fetch_trending.py (3)

44-132: Retry logic is well-implemented.

The exponential backoff with rate limit handling, server error retries, and timeout handling covers the expected failure modes for GitHub API interactions robustly.


223-234: Multi-pass search strategy is well-designed.

The progressive broadening (wider date range, lower star threshold, dropping topic filters) effectively increases candidate pool when initial results are insufficient.


327-329: Final sorting ensures consistent output.

Sorting by stargazersCount before truncating ensures the most starred repositories are retained regardless of discovery order.

composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/data_source/CachedTrendingDataSource.kt (5)

3-20: LGTM!

The imports are well-organized and align with the refactored HTTP handling approach (manual JSON parsing, explicit timeout/retry plugins).


30-44: LGTM!

The HttpClient configuration is well-tuned: reasonable timeouts, exponential backoff retry on server errors, and expectSuccess = false for manual response handling.


130-149: Same naming convention concern applies here.

Fields like fullName, defaultBranch, htmlUrl, stargazersCount, forksCount, releasesUrl, and updatedAt would typically be snake_case in raw GitHub API data. If the cached JSON uses snake_case, add @SerialName annotations (e.g., @SerialName("full_name") val fullName: String).


151-159: Field avatarUrl likely needs @SerialName("avatar_url").

Same snake_case concern applies. The verification script above will confirm the actual JSON structure.


119-128: No action needed. The cached JSON files use camelCase field names (e.g., lastUpdated, totalCount, fullName, defaultBranch, htmlUrl, stargazersCount, forksCount, updatedAt) that exactly match the property names in CachedRepoResponse and CachedGithubRepoSummary. The @SerialName annotations are not required, and the serialization works correctly with the current configuration.

Likely an incorrect or invalid review comment.

composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/repository/HomeRepositoryImpl.kt (2)

35-35: LGTM!

Import correctly added for the toGithubRepoSummary extension function.


61-65: LGTM!

The mapping correctly converts cached repository models to domain models before emission, ensuring type consistency throughout the data layer.

Comment thread scripts/fetch_trending.py
seen: set = set()
attempt = 0
max_attempts = 4
min_count = 10 # Ensure at least this many if possible
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused variable min_count.

This variable is assigned but never used. Per static analysis hint.

Proposed fix
-    min_count = 10  # Ensure at least this many if possible
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
min_count = 10 # Ensure at least this many if possible
🧰 Tools
🪛 Ruff (0.14.10)

221-221: Local variable min_count is assigned to but never used

Remove assignment to unused variable min_count

(F841)

🤖 Prompt for AI Agents
In scripts/fetch_trending.py around line 221, the variable assignment "min_count
= 10" is unused; remove the unused variable declaration (or if intended, apply
it where the count of items is determined—e.g., enforce a minimum result count
by using min_count in that filtering/selection logic). Update the code to either
delete the line or integrate min_count into the selection logic so it is
referenced and no longer triggers a static-analysis warning.

Comment thread scripts/requirements.txt Outdated
This commit updates the version of the `requests` library in the project's requirements.
@rainxchzed rainxchzed merged commit 99569cb into main Dec 29, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant