[Repo Assist] fix(launcher): eliminate race condition in SessionConnectionPool.Get#2634
Conversation
The Get method used an unsafe read-lock upgrade pattern: RLock → RUnlock → Lock → update metadata → Unlock → RLock This created a race window between the manual RUnlock and subsequent Lock where another goroutine could call Delete on the same key. If that happened, Get would update and return a connection that had been removed from the pool (and potentially closed). Fix by acquiring a write lock for the full operation, which is the standard, correct pattern for read-then-update operations. The read-lock optimisation was premature here since every successful Get also performs three write operations on the metadata. Add TestConnectionPoolConcurrencyWithDeletes to exercise concurrent Get and Delete operations; this test is designed to be run with -race to detect the data race on unpatched code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Fixes a concurrency bug in internal/launcher where SessionConnectionPool.Get used an unsafe RWMutex “upgrade” pattern that allowed Delete to interleave between read and write phases, causing races and inconsistent behavior.
Changes:
- Replace the
RLock → unlock → Lockpattern inSessionConnectionPool.Getwith a singleLockfor the full read+metadata-update operation. - Add a new concurrency test that interleaves
Get,Set, andDeleteto exercise the previously racy path (intended for validation withgo test -race).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| internal/launcher/connection_pool.go | Makes Get atomic w.r.t. map entry existence + metadata updates by holding the write lock for the whole operation. |
| internal/launcher/connection_pool_test.go | Adds a concurrent Get/Set/Delete test to help surface the prior race condition. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // No assertion needed — the goal is to detect races via -race flag. | ||
| // If the race condition is present, this test will fail non-deterministically | ||
| // under the Go race detector (go test -race). |
There was a problem hiding this comment.
This test is intended to catch the old race only under go test -race, but the repo’s workflows appear to run go test ./... without -race. As written it has no assertions, so it will always pass in normal CI and may not prevent regressions. Consider moving this test into a //go:build race-guarded *_test.go file (so it only runs when -race is enabled), or add at least one deterministic post-condition that can fail under normal go test if the bug reappears.
| // No assertion needed — the goal is to detect races via -race flag. | |
| // If the race condition is present, this test will fail non-deterministically | |
| // under the Go race detector (go test -race). | |
| // After all concurrent operations, at most one connection for this | |
| // backend/session pair should exist in the pool. | |
| assert.LessOrEqual(t, pool.Size(), 1) | |
| // This assertion provides a deterministic post-condition for normal | |
| // `go test`, while running with `-race` can still detect data races. |
| ctx := context.Background() | ||
| pool := NewSessionConnectionPool(ctx) | ||
|
|
||
| mockConn := &mcp.Connection{} | ||
| pool.Set("backend1", "session1", mockConn) | ||
|
|
There was a problem hiding this comment.
The pool starts a cleanup goroutine in NewSessionConnectionPool. This test doesn’t call pool.Stop(), so it leaves that goroutine running for the remainder of the package test run. Add defer pool.Stop() (or create the pool with a cancellable context and cancel it) to avoid goroutine leaks and reduce cross-test interference.
🤖 This is an automated pull request from Repo Assist, an AI assistant.
Summary
SessionConnectionPool.Getused an unsafe read-lock upgrade pattern that introduced a data race:Between the manual
RUnlockand subsequentLock, another goroutine could callDeleteon the same key. If that happened,Getwould update and return a connection that had already been removed from the pool (and potentially closed), violating the caller's assumption that the returned connection is live.The Go race detector (
go test -race) would flag this as a data race onmetadata.LastUsedAt,metadata.RequestCount, andmetadata.State.Root cause
The original code tried to optimise by using a read lock for the initial map lookup, upgrading to a write lock only for the three metadata field writes. This optimisation is not safe for
sync.RWMutexbecause Go'sRWMutexdoes not support lock upgrading — you must release the read lock before acquiring the write lock, creating an unavoidable race window.Fix
Replace the entire operation with a single write lock (
p.mu.Lock()/defer p.mu.Unlock()). This is the standard, correct pattern for read-then-update operations. Since every successfulGetalso performs three writes to the metadata struct, the read-lock optimisation provided minimal benefit and the lock contention difference is negligible.Changes
internal/launcher/connection_pool.go:Getnow usesp.mu.Lock()/defer p.mu.Unlock()throughout, removing the manual lock upgrade.internal/launcher/connection_pool_test.go: AddedTestConnectionPoolConcurrencyWithDeleteswhich runs concurrentGet,Set, andDeleteoperations to exercise the previously racy path. Run withgo test -raceto confirm correctness.Trade-offs
The write lock means concurrent reads from different sessions are serialised during
Get. In practice this is unlikely to be a bottleneck — the lock is held only for a map lookup and three field writes (nanoseconds), and the actual MCP backend communication happens outside the lock.Test Status
Closes N/A (proactive bug fix — no linked issue)
Warning
The following domain was blocked by the firewall during workflow execution:
proxy.golang.orgTo allow these domains, add them to the
network.allowedlist in your workflow frontmatter:See Network Configuration for more information.
Note
🔒 Integrity filter blocked 14 items
The following items were blocked because they don't meet the GitHub integrity level.
list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "merged".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "merged".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "merged".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "merged".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "merged".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "merged".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "merged".list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "merged".list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "merged".list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "merged".list_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "merged".search_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "merged".issue_read: has lower integrity than agent requires. The agent cannot read data with integrity below "merged".search_pull_requests: has lower integrity than agent requires. The agent cannot read data with integrity below "merged".To allow these resources, lower
min-integrityin your GitHub frontmatter: