fix: smart archive fallback for HTTP 300 + fix socket leak + pandas FutureWarnings#910
Open
priya-gitTest wants to merge 5 commits intoKnowledgeCaptureAndDiscovery:masterfrom
Conversation
….exit() When neither the primary archive URL nor the main.zip fallback returns HTTP 200, download_github_files() was calling sys.exit() which crashed the entire SOMEF process. Replace with logging.error() + return None, consistent with how download_readme() handles the same situation. Fixes KnowledgeCaptureAndDiscovery#909. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mit_get ## Root cause of issue KnowledgeCaptureAndDiscovery#909 (balaje/icefem) GitHub's short-form archive URL `/archive/{ref}.zip` returns HTTP 300 (Multiple Choices) when the ref name is ambiguous — i.e. when a branch and a tag share the same name (e.g. default branch `v2.0` and a tag also named `v2.0`). The previous code only handled HTTP 404 with a single `main.zip` fallback, so a 300 response bypassed the fallback and hit `sys.exit()`. ## Changes ### download_github_files — smarter fallback chain Instead of one hard-coded `main.zip` fallback, we now try four candidate URLs in order until one returns HTTP 200: 1. /archive/{ref}.zip — short form, works for unambiguous refs 2. /archive/refs/heads/{ref}.zip — explicit branch (resolves HTTP 300) 3. /archive/refs/tags/{ref}.zip — explicit tag (resolves HTTP 300) 4. /archive/main.zip — legacy fallback for renamed branches Any non-200 status (300, 404, …) moves to the next candidate. If all four fail, we log an error and return None instead of calling sys.exit(). ### rate_limit_get — fix streaming socket leak The size-check before download was using `requests.get(..., stream=True)` to fetch just the Content-Length header. A streaming GET opens a full TCP connection and begins transferring the response body; if the stream is never read to completion and never explicitly closed, the underlying socket is held open (leaked). Replaced with `requests.head(...)` + `response.close()` — HEAD retrieves only headers without a body, which is exactly what is needed here. Also removed a stray `print()` debug statement. Fixes KnowledgeCaptureAndDiscovery#909 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ve fallback and socket leak fix TestIssue909ArchiveFallback: - HTTP 300 on short URL falls back to refs/heads/ (the balaje/icefem case) - HTTP 300 → refs/heads/ 404 → falls back to refs/tags/ - HTTP 404 on all ref URLs reaches the legacy main.zip fallback - All fallbacks failing returns None instead of calling sys.exit() - Size-limit None response stops the fallback loop immediately TestRateLimitGetHeadRequest: - Verifies requests.head() is now used (not streaming GET) for size check - Verifies HEAD response is explicitly closed even on size-exceeded early return Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…FutureWarnings
pandas 3.0 will remove support for inplace= on chained operations.
Three active call sites were still using the deprecated pattern:
df.dropna(subset=['Content'], inplace=True)
→ df = df.dropna(subset=['Content'])
df.drop(columns=['ParentGroup'], inplace=True)
→ df = df.drop(columns=['ParentGroup'])
valid.rename(columns={...}, inplace=True)
→ valid = valid.rename(columns={...})
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add None check in get_all_paginated_results before accessing response.status_code - Add None check for languages_raw before calling .json() in load_online_repository_metadata - Add timeout to requests.head() in rate_limit_get to prevent indefinite hangs - Remove leftover `import pdb` from extract_software_type.py - Replace debug print() calls in check_static_websites with logging.warning() These fixes address crashes when rate_limit_get returns None (e.g. size limit exceeded or network error) and eliminate stdout pollution from debug prints that could corrupt JSON output when SOMEF is called programmatically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collaborator
|
I have reviewed and some of the tests with mock responses are very appropriate. Other stuff was being incorporated in our other PR. Thanks a lot!! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #909
Root cause
https://github.com/balaje/icefem/archive/v2.0.zipreturns HTTP 300 (Multiple Choices) becausev2.0exists as both a branch and a tag. The previous code only handled 404, so 300 bypassed the fallback and hitsys.exit(), crashing the entire process.Changes
download_github_files— smart fallback chainFour candidate URLs tried in order until one returns HTTP 200:
/archive/{ref}.zip— short form (fails with 300 for ambiguous refs)/archive/refs/heads/{ref}.zip— unambiguous branch (fixes HTTP 300)/archive/refs/tags/{ref}.zip— unambiguous tag (fixes HTTP 300)/archive/main.zip— legacy branch-rename fallbackAny non-200 status (300, 404, …) moves to the next candidate. All failures return
Noneinstead of callingsys.exit().rate_limit_get— fix streaming socket leakThe size-check before download was using
requests.get(..., stream=True)to fetch only theContent-Lengthheader. A streaming GET holds the TCP socket open until the body is fully read or the response is explicitly closed — neither happened, leaking one socket per archive inspected. Replaced withrequests.head()+response.close().header_analysis.py— pandas FutureWarningsThree active
inplace=Truecalls replaced with reassignment to silence pandas 3.0 deprecation warnings that appear in every SOMEF run.Tests added
TestIssue909ArchiveFallback(5 tests): HTTP 300→refs/heads, 300→refs/tags, 404→main fallback, all-fail returns None, size-limit stops loopTestRateLimitGetHeadRequest(2 tests): HEAD used instead of streaming GET, HEAD response closed on size exceeded