Skip to content

Fix late selective_channel retry after EndRPC#3359

Open
hjwsm1989 wants to merge 1 commit into
apache:masterfrom
hjwsm1989:codex/selective-channel-late-subdone-fix
Open

Fix late selective_channel retry after EndRPC#3359
hjwsm1989 wants to merge 1 commit into
apache:masterfrom
hjwsm1989:codex/selective-channel-late-subdone-fix

Conversation

@hjwsm1989

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: #3358

Problem Summary:
This fixes a selective_channel race where a late SubDone::Run() may re-enter
retry/backup after the main RPC has already entered EndRPC().

The crash report in #3358 shows that a late sub-call callback can still flow into:

  • Controller::OnVersionedRPCReturned()
  • Controller::IssueRPC()
  • schan::Sender::IssueRPC()

after the main controller has already started tearing down its state.

That leaves selective_channel vulnerable to retrying on partially torn-down
state, including the previously observed null balancer path.

What is changed and the side effects?

Changed:

  • mark the controller as "ending RPC" at the beginning of EndRPC()
  • ignore late SubDone callbacks once the main RPC is already ending
  • keep a defensive null check in schan::Sender::IssueRPC()

This keeps the retry/backup state machine from re-entering after teardown has
started, and also preserves a hard guard at the selective_channel boundary.

Test:
Added a regression test that:

  • uses SelectiveChannel
  • enables backup request and retry
  • lets the main RPC time out first
  • lets delayed sub-calls finish later

This reproduces the late callback window and verifies it no longer re-enters
retry/backup after timeout.

Also re-ran related selective/backup request tests.

Side effects:

  • Performance effects:

  • Breaking backward compatibility:


Check List:

@hjwsm1989 hjwsm1989 force-pushed the codex/selective-channel-late-subdone-fix branch from c2c7d03 to e27a7e9 Compare June 25, 2026 06:01
@hjwsm1989

Copy link
Copy Markdown
Contributor Author

@chenBright hello, please help review

_main_cntl->SetFailed(ECANCELED,
"SelectiveChannel balancer is unavailable");
return -1;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem seems unresolved. The process still crashes even if the balancer is deleted by main_cntl after the if statement.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenBright thanks for review, The previous null check was not enough because it did not keep the balancer alive after _lb.get().

I updated the patch so Sender holds its own intrusive_ptr<SharedLoadBalancer> captured from SelectiveChannel at creation time, and IssueRPC() uses that retained reference. This prevents main_cntl->_lb.reset() in EndRPC() from freeing the balancer while IssueRPC() is still using it.

The FLAGS_ENDING_RPC guard is kept to prevent late SubDone from re-entering retry / backup after the main RPC starts ending.

@hjwsm1989 hjwsm1989 force-pushed the codex/selective-channel-late-subdone-fix branch from e27a7e9 to 531f2d7 Compare June 27, 2026 09:41
@chenBright chenBright requested a review from Copilot June 27, 2026 09:53

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a SelectiveChannel race where late SubDone::Run() callbacks could re-enter retry/backup logic after the main RPC has entered Controller::EndRPC(), potentially operating on partially torn-down controller state (including a previously observed null balancer path).

Changes:

  • Introduces an “ending RPC” controller flag set at the beginning of Controller::EndRPC() and ignores late OnVersionedRPCReturned() callbacks once teardown has begun.
  • Makes schan::Sender hold its own reference to the load balancer and adds a defensive null check before selecting a sub-channel.
  • Adds a regression test that forces the main RPC to time out while delayed sub-calls finish later, validating late-callback behavior under retry/backup configuration.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
test/brpc_channel_unittest.cpp Adds a regression test exercising late sub-call completion after main timeout with retry + backup enabled.
src/brpc/selective_channel.cpp Passes the balancer into schan::Sender and adds a defensive null check before using it.
src/brpc/controller.h Adds FLAGS_ENDING_RPC and a helper accessor to expose “ending RPC” state.
src/brpc/controller.cpp Sets the “ending RPC” flag early in EndRPC() and short-circuits late callbacks in OnVersionedRPCReturned().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 580 to +584
if (!initialized()) {
cntl->SetFailed(EINVAL, "SelectiveChannel=%p is not initialized yet",
this);
}
schan::Sender* sndr = new schan::Sender(cntl, request, response, user_done);
schan::Sender* sndr =

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. SelectiveChannel::CallMethod() now returns immediately when the channel is not initialized, and runs done in-place if provided, matching the behavior of PartitionChannel.

This avoids allocating Sender or dispatching into the inner channel with an unavailable balancer. I also added a regression test covering both sync and async uninitialized calls.

Ignore late selective_channel SubDone callbacks once the main RPC enters EndRPC, and let Sender hold the selective balancer while issuing sub-RPCs.

Return early when SelectiveChannel is called before Init, preserving the intended EINVAL result.

Add regression tests covering timeout plus delayed sub-call completion and uninitialized SelectiveChannel calls.
@hjwsm1989 hjwsm1989 force-pushed the codex/selective-channel-late-subdone-fix branch from 531f2d7 to 4217a8a Compare June 27, 2026 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants