refactor(data): Improve caching and trending repo fetching#107
Conversation
This commit refactors the data layer for fetching and handling cached trending repositories, improving both the client-side parsing and the backend script that generates the data. ### Key Changes: **Data Fetching & Parsing:** - Introduced `CachedGithubRepoSummary` and `CachedGithubOwner` data classes to precisely match the structure of the pre-cached JSON files. This prevents parsing errors if the full `GithubRepoSummary` model contains fields not present in the cached data. - Added a `toGithubRepoSummary()` extension function to map the cached data model to the domain model used in the app. - Enhanced logging in `CachedTrendingDataSource` with more detailed messages for success, failures (404), timeouts, and serialization errors to improve debugging. - Removed the `ContentNegotiation` plugin from the dedicated `HttpClient` in `CachedTrendingDataSource` to handle JSON parsing manually, providing better error handling. **Backend Script (`fetch_trending.py`):** - Implemented a more robust, multi-attempt search strategy to find a sufficient number of relevant repositories. - The script now progressively broadens its search criteria (widening the date range, lowering the star requirement, and eventually dropping topic filters) across multiple attempts if not enough results are found initially. - The desired number of repositories to fetch per platform has been increased from 30 to 80 to provide a richer dataset. - The logic now tracks repositories that have already been checked (in a `seen` set) to avoid redundant API calls. - The final list of repositories is sorted by star count before being saved. **CI/CD (`fetch-trending-repos.yml`):** - The cron schedule for the trending repositories job has been changed from every 6 hours to every 12 hours to reduce build frequency. - The Git commit-and-push logic is simplified to use `git commit || echo "No changes to commit"` to gracefully handle cases where no data has changed, removing the need for a separate check step.
WalkthroughThe pull request enhances the trending repository data collection system by restructuring the workflow schedule (6 hours to 12 hours), introducing a cached data model layer with explicit JSON parsing, implementing retry-enabled multi-pass search logic in the fetch script, and converting cached models to domain objects in the repository layer. Changes
Sequence Diagram(s)sequenceDiagram
participant GH as GitHub Actions<br/>(Workflow)
participant Fetch as fetch_trending.py<br/>(Multi-pass Fetch)
participant GitAPI as GitHub API
participant Cache as CachedTrendingDataSource
participant JSON as JSON Parser
participant Repo as HomeRepositoryImpl
participant Domain as Domain Model
rect rgb(220, 240, 255)
Note over GH: Every 12 hours
GH->>GH: git pull --rebase
end
rect rgb(230, 245, 230)
Note over Fetch: Multi-pass Search<br/>(Dynamic params)
loop Attempts: days, stars, topics
Fetch->>Fetch: make_request_with_retry()
Fetch->>GitAPI: Search Query<br/>(with exponential backoff)
GitAPI-->>Fetch: Results
Fetch->>Fetch: Score & Filter<br/>(score >= 5)
Fetch->>Fetch: Dedup via seen set
Fetch->>Fetch: Check installers<br/>(top 50)
end
Fetch->>Fetch: Build summary objects<br/>(complete metadata)
Fetch->>Fetch: Sort by stargazers<br/>(stable order)
end
rect rgb(255, 240, 220)
Note over Cache: Caching Layer
Fetch->>Cache: Store CachedRepoResponse<br/>(cached data)
Cache->>JSON: Parse response
JSON-->>Cache: CachedRepoResponse objects
end
rect rgb(245, 220, 245)
Note over Repo: Domain Mapping
Cache->>Repo: getCachedTrendingRepos()
Repo->>Domain: toGithubRepoSummary()<br/>(convert models)
Domain-->>Repo: GithubRepoSummary list
Repo-->>Repo: Emit PaginatedRepos
end
rect rgb(220, 240, 255)
Note over GH: Finalize
GH->>GH: Stage cached-data/
GH->>GH: Commit (if changes)
GH->>GH: Push to remote
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/data_source/CachedTrendingDataSource.kt (1)
52-109: Theclose()method onCachedTrendingDataSourceis never invoked in the codebase.The class has a
close()method (lines 114-116) that properly closes the HttpClient, but no consumer ofCachedTrendingDataSourcecalls it. Since the class is registered as a singleton in Koin's DI container (SharedModules.kt), the HttpClient and its connection pools remain open for the entire application lifetime, leaking resources until process termination. Either implementCloseable/AutoCloseablewith proper DI scope cleanup, or ensure the repository or application lifecycle handler explicitly callscachedDataSource.close()on shutdown.
🧹 Nitpick comments (3)
scripts/fetch_trending.py (2)
312-318: Redundantseen.add()call on line 313.Line 318 adds
full_nametoseenunconditionally after both branches, making line 313 redundant. The set will handle duplicates, but it's cleaner to have a single add.Proposed fix
results.append(summary) - seen.add(full_name) print(f"✓ Found ({len(results)}/{desired_count}) {full_name}") else: print(f"✗ No installers {full_name}") seen.add(full_name) # Add to seen even if no installers to avoid rechecking
323-325: Consider narrowing the exception type.While catching broad
Exceptionhere ensures resilience, you could narrow it to expected types like(json.JSONDecodeError, KeyError, TypeError)for better error specificity. This is optional since the current approach logs the error and continues gracefully.composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/data_source/CachedTrendingDataSource.kt (1)
164-174: Hardcodedid = 0for owner may cause issues if used as a key.If
GithubUser.idis used in equality checks, as a map key, or for diffing in UI lists, all cached repo owners will collide. Consider using a deterministic placeholder (e.g.,owner.login.hashCode().toLong()) or ensure downstream code doesn't rely on owneridfor uniqueness.🔎 Alternative using deterministic ID
owner = GithubUser( - id = 0, + id = owner.login.hashCode().toLong(), login = owner.login, avatarUrl = owner.avatarUrl, htmlUrl = "https://github.com/${owner.login}" ),
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.github/workflows/fetch-trending-repos.ymlcomposeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/data_source/CachedTrendingDataSource.ktcomposeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/repository/HomeRepositoryImpl.ktscripts/fetch_trending.pyscripts/requirements.txt
🧰 Additional context used
🪛 Ruff (0.14.10)
scripts/fetch_trending.py
221-221: Local variable min_count is assigned to but never used
Remove assignment to unused variable min_count
(F841)
323-323: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (11)
.github/workflows/fetch-trending-repos.yml (1)
39-42: Rebase failure could cause silent push issues.If
git pull --rebaseencounters conflicts, the workflow will fail at that step. However, the current flow handles the common case well. Consider adding|| exit 1to make failures explicit, though this may be acceptable for the use case.One minor note: if the commit step outputs "No changes to commit", the subsequent
git pushwill still execute but is a harmless no-op (pushes current HEAD which is already up-to-date).scripts/fetch_trending.py (3)
44-132: Retry logic is well-implemented.The exponential backoff with rate limit handling, server error retries, and timeout handling covers the expected failure modes for GitHub API interactions robustly.
223-234: Multi-pass search strategy is well-designed.The progressive broadening (wider date range, lower star threshold, dropping topic filters) effectively increases candidate pool when initial results are insufficient.
327-329: Final sorting ensures consistent output.Sorting by
stargazersCountbefore truncating ensures the most starred repositories are retained regardless of discovery order.composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/data_source/CachedTrendingDataSource.kt (5)
3-20: LGTM!The imports are well-organized and align with the refactored HTTP handling approach (manual JSON parsing, explicit timeout/retry plugins).
30-44: LGTM!The HttpClient configuration is well-tuned: reasonable timeouts, exponential backoff retry on server errors, and
expectSuccess = falsefor manual response handling.
130-149: Same naming convention concern applies here.Fields like
fullName,defaultBranch,htmlUrl,stargazersCount,forksCount,releasesUrl, andupdatedAtwould typically be snake_case in raw GitHub API data. If the cached JSON uses snake_case, add@SerialNameannotations (e.g.,@SerialName("full_name") val fullName: String).
151-159: FieldavatarUrllikely needs@SerialName("avatar_url").Same snake_case concern applies. The verification script above will confirm the actual JSON structure.
119-128: No action needed. The cached JSON files use camelCase field names (e.g.,lastUpdated,totalCount,fullName,defaultBranch,htmlUrl,stargazersCount,forksCount,updatedAt) that exactly match the property names inCachedRepoResponseandCachedGithubRepoSummary. The@SerialNameannotations are not required, and the serialization works correctly with the current configuration.Likely an incorrect or invalid review comment.
composeApp/src/commonMain/kotlin/zed/rainxch/githubstore/feature/home/data/repository/HomeRepositoryImpl.kt (2)
35-35: LGTM!Import correctly added for the
toGithubRepoSummaryextension function.
61-65: LGTM!The mapping correctly converts cached repository models to domain models before emission, ensuring type consistency throughout the data layer.
| seen: set = set() | ||
| attempt = 0 | ||
| max_attempts = 4 | ||
| min_count = 10 # Ensure at least this many if possible |
There was a problem hiding this comment.
Remove unused variable min_count.
This variable is assigned but never used. Per static analysis hint.
Proposed fix
- min_count = 10 # Ensure at least this many if possible📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| min_count = 10 # Ensure at least this many if possible |
🧰 Tools
🪛 Ruff (0.14.10)
221-221: Local variable min_count is assigned to but never used
Remove assignment to unused variable min_count
(F841)
🤖 Prompt for AI Agents
In scripts/fetch_trending.py around line 221, the variable assignment "min_count
= 10" is unused; remove the unused variable declaration (or if intended, apply
it where the count of items is determined—e.g., enforce a minimum result count
by using min_count in that filtering/selection logic). Update the code to either
delete the line or integrate min_count into the selection logic so it is
referenced and no longer triggers a static-analysis warning.
This commit updates the version of the `requests` library in the project's requirements.
This commit refactors the data layer for fetching and handling cached trending repositories, improving both the client-side parsing and the backend script that generates the data.
Key Changes:
Data Fetching & Parsing:
CachedGithubRepoSummaryandCachedGithubOwnerdata classes to precisely match the structure of the pre-cached JSON files. This prevents parsing errors if the fullGithubRepoSummarymodel contains fields not present in the cached data.toGithubRepoSummary()extension function to map the cached data model to the domain model used in the app.CachedTrendingDataSourcewith more detailed messages for success, failures (404), timeouts, and serialization errors to improve debugging.ContentNegotiationplugin from the dedicatedHttpClientinCachedTrendingDataSourceto handle JSON parsing manually, providing better error handling.Backend Script (
fetch_trending.py):seenset) to avoid redundant API calls.CI/CD (
fetch-trending-repos.yml):git commit || echo "No changes to commit"to gracefully handle cases where no data has changed, removing the need for a separate check step.Summary by CodeRabbit
New Features
Bug Fixes
Chores
✏️ Tip: You can customize this high-level summary in your review settings.