Skip to content

fix: smart archive fallback for HTTP 300 + fix socket leak + pandas FutureWarnings#910

Open
priya-gitTest wants to merge 5 commits intoKnowledgeCaptureAndDiscovery:masterfrom
priya-gitTest:fix/909-graceful-archive-download-failure
Open

fix: smart archive fallback for HTTP 300 + fix socket leak + pandas FutureWarnings#910
priya-gitTest wants to merge 5 commits intoKnowledgeCaptureAndDiscovery:masterfrom
priya-gitTest:fix/909-graceful-archive-download-failure

Conversation

@priya-gitTest
Copy link

Fixes #909

Root cause

https://github.com/balaje/icefem/archive/v2.0.zip returns HTTP 300 (Multiple Choices) because v2.0 exists as both a branch and a tag. The previous code only handled 404, so 300 bypassed the fallback and hit sys.exit(), crashing the entire process.

Changes

download_github_files — smart fallback chain

Four candidate URLs tried in order until one returns HTTP 200:

  1. /archive/{ref}.zip — short form (fails with 300 for ambiguous refs)
  2. /archive/refs/heads/{ref}.zip — unambiguous branch (fixes HTTP 300)
  3. /archive/refs/tags/{ref}.zip — unambiguous tag (fixes HTTP 300)
  4. /archive/main.zip — legacy branch-rename fallback

Any non-200 status (300, 404, …) moves to the next candidate. All failures return None instead of calling sys.exit().

rate_limit_get — fix streaming socket leak

The size-check before download was using requests.get(..., stream=True) to fetch only the Content-Length header. A streaming GET holds the TCP socket open until the body is fully read or the response is explicitly closed — neither happened, leaking one socket per archive inspected. Replaced with requests.head() + response.close().

header_analysis.py — pandas FutureWarnings

Three active inplace=True calls replaced with reassignment to silence pandas 3.0 deprecation warnings that appear in every SOMEF run.

Tests added

  • TestIssue909ArchiveFallback (5 tests): HTTP 300→refs/heads, 300→refs/tags, 404→main fallback, all-fail returns None, size-limit stops loop
  • TestRateLimitGetHeadRequest (2 tests): HEAD used instead of streaming GET, HEAD response closed on size exceeded

priya-gitTest and others added 5 commits March 5, 2026 20:48
….exit()

When neither the primary archive URL nor the main.zip fallback returns HTTP 200,
download_github_files() was calling sys.exit() which crashed the entire SOMEF process.
Replace with logging.error() + return None, consistent with how download_readme() handles
the same situation. Fixes KnowledgeCaptureAndDiscovery#909.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mit_get

## Root cause of issue KnowledgeCaptureAndDiscovery#909 (balaje/icefem)

GitHub's short-form archive URL `/archive/{ref}.zip` returns HTTP 300
(Multiple Choices) when the ref name is ambiguous — i.e. when a branch and
a tag share the same name (e.g. default branch `v2.0` and a tag also named
`v2.0`).  The previous code only handled HTTP 404 with a single `main.zip`
fallback, so a 300 response bypassed the fallback and hit `sys.exit()`.

## Changes

### download_github_files — smarter fallback chain
Instead of one hard-coded `main.zip` fallback, we now try four candidate
URLs in order until one returns HTTP 200:
  1. /archive/{ref}.zip              — short form, works for unambiguous refs
  2. /archive/refs/heads/{ref}.zip   — explicit branch (resolves HTTP 300)
  3. /archive/refs/tags/{ref}.zip    — explicit tag    (resolves HTTP 300)
  4. /archive/main.zip               — legacy fallback for renamed branches

Any non-200 status (300, 404, …) moves to the next candidate.  If all four
fail, we log an error and return None instead of calling sys.exit().

### rate_limit_get — fix streaming socket leak
The size-check before download was using `requests.get(..., stream=True)` to
fetch just the Content-Length header.  A streaming GET opens a full TCP
connection and begins transferring the response body; if the stream is never
read to completion and never explicitly closed, the underlying socket is held
open (leaked).  Replaced with `requests.head(...)` + `response.close()` —
HEAD retrieves only headers without a body, which is exactly what is needed
here.  Also removed a stray `print()` debug statement.

Fixes KnowledgeCaptureAndDiscovery#909

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ve fallback and socket leak fix

TestIssue909ArchiveFallback:
- HTTP 300 on short URL falls back to refs/heads/ (the balaje/icefem case)
- HTTP 300 → refs/heads/ 404 → falls back to refs/tags/
- HTTP 404 on all ref URLs reaches the legacy main.zip fallback
- All fallbacks failing returns None instead of calling sys.exit()
- Size-limit None response stops the fallback loop immediately

TestRateLimitGetHeadRequest:
- Verifies requests.head() is now used (not streaming GET) for size check
- Verifies HEAD response is explicitly closed even on size-exceeded early return

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…FutureWarnings

pandas 3.0 will remove support for inplace= on chained operations.
Three active call sites were still using the deprecated pattern:

  df.dropna(subset=['Content'], inplace=True)
      → df = df.dropna(subset=['Content'])

  df.drop(columns=['ParentGroup'], inplace=True)
      → df = df.drop(columns=['ParentGroup'])

  valid.rename(columns={...}, inplace=True)
      → valid = valid.rename(columns={...})

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add None check in get_all_paginated_results before accessing response.status_code
- Add None check for languages_raw before calling .json() in load_online_repository_metadata
- Add timeout to requests.head() in rate_limit_get to prevent indefinite hangs
- Remove leftover `import pdb` from extract_software_type.py
- Replace debug print() calls in check_static_websites with logging.warning()

These fixes address crashes when rate_limit_get returns None (e.g. size limit
exceeded or network error) and eliminate stdout pollution from debug prints
that could corrupt JSON output when SOMEF is called programmatically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dgarijo
Copy link
Collaborator

dgarijo commented Mar 6, 2026

I have reviewed and some of the tests with mock responses are very appropriate. Other stuff was being incorporated in our other PR. Thanks a lot!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prpoblematic github link when unable to access archive of repo

2 participants