Skip to content

[Repo Assist] fix(launcher): eliminate race condition in SessionConnectionPool.Get#2634

Merged
lpcox merged 1 commit into
mainfrom
repo-assist/fix-connection-pool-race-2026-03-27-26ab7396e3bcb915
Mar 27, 2026
Merged

[Repo Assist] fix(launcher): eliminate race condition in SessionConnectionPool.Get#2634
lpcox merged 1 commit into
mainfrom
repo-assist/fix-connection-pool-race-2026-03-27-26ab7396e3bcb915

Conversation

@github-actions

Copy link
Copy Markdown
Contributor

🤖 This is an automated pull request from Repo Assist, an AI assistant.

Summary

SessionConnectionPool.Get used an unsafe read-lock upgrade pattern that introduced a data race:

RLock → read metadata → RUnlock → Lock → write metadata → Unlock → RLock

Between the manual RUnlock and subsequent Lock, another goroutine could call Delete on the same key. If that happened, Get would update and return a connection that had already been removed from the pool (and potentially closed), violating the caller's assumption that the returned connection is live.

The Go race detector (go test -race) would flag this as a data race on metadata.LastUsedAt, metadata.RequestCount, and metadata.State.

Root cause

The original code tried to optimise by using a read lock for the initial map lookup, upgrading to a write lock only for the three metadata field writes. This optimisation is not safe for sync.RWMutex because Go's RWMutex does not support lock upgrading — you must release the read lock before acquiring the write lock, creating an unavoidable race window.

Fix

Replace the entire operation with a single write lock (p.mu.Lock() / defer p.mu.Unlock()). This is the standard, correct pattern for read-then-update operations. Since every successful Get also performs three writes to the metadata struct, the read-lock optimisation provided minimal benefit and the lock contention difference is negligible.

Changes

  • internal/launcher/connection_pool.go: Get now uses p.mu.Lock() / defer p.mu.Unlock() throughout, removing the manual lock upgrade.
  • internal/launcher/connection_pool_test.go: Added TestConnectionPoolConcurrencyWithDeletes which runs concurrent Get, Set, and Delete operations to exercise the previously racy path. Run with go test -race to confirm correctness.

Trade-offs

The write lock means concurrent reads from different sessions are serialised during Get. In practice this is unlikely to be a bottleneck — the lock is held only for a map lookup and three field writes (nanoseconds), and the actual MCP backend communication happens outside the lock.

Test Status

⚠️ Infrastructure note: the workflow environment cannot download the Go 1.25.0 toolchain (network access to proxy.golang.org is blocked). Tests could not be executed locally. The code change is mechanically correct (removing the manual lock dance, keeping identical logic), and the existing test suite is expected to pass in CI where the toolchain is available.

Closes N/A (proactive bug fix — no linked issue)

Warning

⚠️ Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • proxy.golang.org

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "proxy.golang.org"

See Network Configuration for more information.

Note

🔒 Integrity filter blocked 14 items

The following items were blocked because they don't meet the GitHub integrity level.

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by Repo Assist ·

To install this agentic workflow, run

gh aw add githubnext/agentics/workflows/repo-assist.md@851905c06e905bf362a9f6cc54f912e3df747d55

The Get method used an unsafe read-lock upgrade pattern:
  RLock → RUnlock → Lock → update metadata → Unlock → RLock

This created a race window between the manual RUnlock and subsequent
Lock where another goroutine could call Delete on the same key. If that
happened, Get would update and return a connection that had been removed
from the pool (and potentially closed).

Fix by acquiring a write lock for the full operation, which is the
standard, correct pattern for read-then-update operations. The read-lock
optimisation was premature here since every successful Get also performs
three write operations on the metadata.

Add TestConnectionPoolConcurrencyWithDeletes to exercise concurrent Get
and Delete operations; this test is designed to be run with -race to
detect the data race on unpatched code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lpcox lpcox marked this pull request as ready for review March 27, 2026 13:55
Copilot AI review requested due to automatic review settings March 27, 2026 13:55
@lpcox lpcox merged commit ccbb875 into main Mar 27, 2026
3 checks passed
@lpcox lpcox deleted the repo-assist/fix-connection-pool-race-2026-03-27-26ab7396e3bcb915 branch March 27, 2026 13:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a concurrency bug in internal/launcher where SessionConnectionPool.Get used an unsafe RWMutex “upgrade” pattern that allowed Delete to interleave between read and write phases, causing races and inconsistent behavior.

Changes:

  • Replace the RLock → unlock → Lock pattern in SessionConnectionPool.Get with a single Lock for the full read+metadata-update operation.
  • Add a new concurrency test that interleaves Get, Set, and Delete to exercise the previously racy path (intended for validation with go test -race).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
internal/launcher/connection_pool.go Makes Get atomic w.r.t. map entry existence + metadata updates by holding the write lock for the whole operation.
internal/launcher/connection_pool_test.go Adds a concurrent Get/Set/Delete test to help surface the prior race condition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +282 to +284
// No assertion needed — the goal is to detect races via -race flag.
// If the race condition is present, this test will fail non-deterministically
// under the Go race detector (go test -race).

Copilot AI Mar 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is intended to catch the old race only under go test -race, but the repo’s workflows appear to run go test ./... without -race. As written it has no assertions, so it will always pass in normal CI and may not prevent regressions. Consider moving this test into a //go:build race-guarded *_test.go file (so it only runs when -race is enabled), or add at least one deterministic post-condition that can fail under normal go test if the bug reappears.

Suggested change
// No assertion needed — the goal is to detect races via -race flag.
// If the race condition is present, this test will fail non-deterministically
// under the Go race detector (go test -race).
// After all concurrent operations, at most one connection for this
// backend/session pair should exist in the pool.
assert.LessOrEqual(t, pool.Size(), 1)
// This assertion provides a deterministic post-condition for normal
// `go test`, while running with `-race` can still detect data races.

Copilot uses AI. Check for mistakes.
Comment on lines +250 to +255
ctx := context.Background()
pool := NewSessionConnectionPool(ctx)

mockConn := &mcp.Connection{}
pool.Set("backend1", "session1", mockConn)

Copilot AI Mar 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pool starts a cleanup goroutine in NewSessionConnectionPool. This test doesn’t call pool.Stop(), so it leaves that goroutine running for the remainder of the package test run. Add defer pool.Stop() (or create the pool with a cancellable context and cancel it) to avoid goroutine leaks and reduce cross-test interference.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants