Workaround: route NuGet RestoreTask to transient TaskHost in server or mt modes#13660
Conversation
Workaround for static singleton state issues in NuGet RestoreTask (e.g., PluginManager, EnvironmentWrapper) that persist across builds when running in sidecar TaskHost processes. When /mt mode or MSBuild server (MSBUILDUSESERVER=1) is active, RestoreTask is now forced to run in a transient (non-sidecar) TaskHost that terminates after execution, ensuring all static state is cleaned up. Changes: - TaskRouter: Add IsKnownProblematicTask() to identify tasks by full name - AssemblyTaskFactory: Force transient TaskHost for problematic tasks - Tests: Add unit and integration tests for the workaround Fixes dotnet#13315 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…alue of MSBUILDUSESERVER. Add logging for task host spawning details.
9793967 to
a1ce582
Compare
|
/review |
|
✅ Expert Code Review (command) completed successfully! |
There was a problem hiding this comment.
Pull request overview
Implements a workaround to isolate NuGet’s RestoreTask (which relies on process-wide static singletons) by forcing it onto a transient TaskHost when running under MSBuild Server mode or /mt, preventing static state from leaking across builds/invocations.
Changes:
- Add an internal “original server mode” env var (
_MSBUILDORIGINALUSESERVER) to preserve server-mode detection through environment snapshotting. - Route allow-listed “known problematic” tasks (currently
NuGet.Build.Tasks.RestoreTask) to a non-sidecar (transient) TaskHost in/mtor server mode. - Add TaskHost diagnostic logging and integration tests validating routing + per-invocation process isolation.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Framework/Traits.cs | Adds _MSBUILDORIGINALUSESERVER env var name and trait for detecting server-mode launch context. |
| src/Build/BackEnd/Components/Communications/NodeLauncher.cs | Stashes/restores the original MSBUILDUSESERVER value into the new internal env var around child process creation. |
| src/Build/BackEnd/Node/OutOfProcServerNode.cs | Preserves the internal env var across SetEnvironment(...) so server node can still detect original server-mode intent. |
| src/Build/BackEnd/Components/RequestBuilder/TaskRouter.cs | Introduces allow-list logic to identify “known problematic” tasks by full type name. |
| src/Build/Instance/TaskFactories/AssemblyTaskFactory.cs | Forces problematic tasks to run in TaskHost and disables sidecar reuse in /mt or server mode. |
| src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcTaskHost.cs | Extends host acquisition to report host PID / creation status for diagnostics. |
| src/Build/Instance/TaskFactories/TaskHostTask.cs | Logs per-invocation TaskHost details (PID, reuse, etc.) into the build log/binlog. |
| src/Build.UnitTests/BackEnd/TaskRouter_IntegrationTests.cs | Adds integration tests for problematic-task routing and “fresh process per invocation” behavior. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Expert Review — 24-Dimension Analysis
Summary
| # | Dimension | Verdict |
|---|---|---|
| 1 | Backward Compatibility | ✅ LGTM (test-only) |
| 2 | ChangeWave Discipline | ✅ LGTM (test-only) |
| 3 | Performance | ✅ LGTM (test-only) |
| 4 | Allocation Awareness | ✅ LGTM (test-only) |
| 5 | Test Coverage | ✅ LGTM — covers MT mode, server mode, fresh-process guarantee, and no-workaround fallback |
| 6 | Error Message Quality | ✅ LGTM (test-only) |
| 7 | Logging Fidelity | ✅ LGTM (test-only) |
| 8 | String Comparison | ✅ LGTM — regex matches ASCII digits only, ShouldContain is ordinal |
| 9 | API Surface | ✅ LGTM (test-only) |
| 10 | Target Authoring | ✅ LGTM (test-only) |
| 11 | Cross-Platform | ✅ LGTM — uses Path.Combine, Process.GetCurrentProcess().Id, standard APIs |
| 12 | Code Simplification | ✅ LGTM — boilerplate duplication is acceptable for test readability |
| 13 | Concurrency | ✅ LGTM — xunit.runner.json enforces parallelizeTestCollections: false / maxParallelThreads: 1 |
| 14 | Naming Precision | ✅ LGTM — names are descriptive and consistent with existing tests |
| 15 | SDK Integration | ✅ LGTM (test-only) |
| 16 | Evaluation Model | ✅ LGTM (test-only) |
| 17 | Correctness | ✅ LGTM — fake RestoreTask FullName matches TaskRouter.IsKnownProblematicTask check; Build() overload is self-contained |
| 18 | Documentation Accuracy | 📝 NIT — see inline comment |
| 19 | Dependency Management | ✅ LGTM (test-only) |
| 20 | Scope Discipline | ✅ LGTM — tests + comment cleanup in the same PR is reasonable |
| 21 | Security | ✅ LGTM (test-only) |
| 22 | Build Infrastructure | ✅ LGTM (test-only) |
| 23 | Binary Log Compatibility | ✅ LGTM (test-only) |
| 24 | Error Handling | ✅ LGTM (test-only) |
Findings: 1 NIT
One nit on the server-mode test's EnableNodeReuse comment — // Load-bearing: see the /mt counterpart. is a fragile cross-reference that could become a dead pointer. Consider duplicating the 1-line explanation inline. Details in the inline comment.
The comment cleanup is well-executed overall: removed XML docs restated what the method name already says, // Arrange/// Act/// Assert markers were noise for these straightforward tests, and the condensed // Load-bearing: comments on the MT test preserve the non-obvious reasoning. The tests themselves are correct and well-structured.
Note
🔒 Integrity filter blocked 2 items
The following items were blocked because they don't meet the GitHub integrity level.
- #13660
pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved". - #13660
pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
To allow these resources, lower min-integrity in your GitHub frontmatter:
tools:
github:
min-integrity: approved # merged | approved | unapproved | noneGenerated by Expert Code Review (command) for issue #13660 · ● 9.7M
…OvesN/msbuild into workaround-restore-issues-in-mt
How this was validated end-to-end
Why this scenario reproduces the bugNuGet keeps the Azure Artifacts credential-provider plugin process alive across The procedure below uses two PATs against One-time setup1. Generate two PATs in the ADO portal — https://dev.azure.com/devdiv/_usersSettings/tokens
Procedure# 1. Open a fresh PowerShell window
$patA = '<paste repro-A>'
$patB = '<paste repro-B>'
# 2. Run the script — it sets the env to PAT-A, runs Restore #1, pauses
cd D:\msbuild-repros\RestoreServerRepro
.\Repro.ps1 `
-MSBuildEnv D:\msbuild\artifacts\msbuild-build-env.ps1 `
-PatPrimary $patA `
-PatSecondary $patB
# 3. Script pauses with "Press ENTER" prompt:
# - Go to https://dev.azure.com/devdiv/_usersSettings/tokens
# - Find 'repro-A', click Revoke, confirm
# - Press ENTER in the scriptRun this once against Results
What's in the script
|
There was a problem hiding this comment.
I think this might be overcomplicating things a bit. I think we could instead introduce an internal property on BuildParameters: there are already some of those. BuildParameters only crosses the process boundary between the main process and worker nodes, so the serialization versioning concerns I initially had also don’t apply.
…OvesN/msbuild into workaround-restore-issues-in-mt
|
/review |
|
@AR-May @JanProvaznik |
|
✅ Expert Code Review (command) completed successfully! Concurrency & Thread Safety — LGTM |
There was a problem hiding this comment.
Expert Code Review — PR #13660
Summary
The approach is sound: routing tasks with problematic static state to transient (non-sidecar) TaskHost processes in long-lived host scenarios. The implementation is well-tested and follows established MSBuild patterns.
Findings Table
| # | Dimension | Severity | Finding |
|---|---|---|---|
| 1 | Correctness & IPC (D22) | MAJOR | IsLongLivedHost not included in ITranslatable.Translate() — workaround may not trigger on OOP worker nodes in pure server mode (without MT) |
| 2 | Code Simplification (D12) | Moderate | catch (Exception) too broad in PID extraction — should be catch (InvalidOperationException) |
| 3 | Error Message Quality (D5) | Minor | Localization comment lists "TaskHost" as non-localizable field name but it's prose text |
| 4 | Error Message Quality (D5) | Minor | "ParentProcessId" terminology slightly confusing when logged from the parent |
| 5 | Design (D10) | Suggestion | Consider FrozenSet<string> for extensibility; file issue for [RequiresProcessIsolation] attribute |
Dimensions Evaluated as LGTM
- ✅ Backwards Compatibility (D1) — Both triggers are opt-in features (MT mode, server mode); no break for traditional builds
- ✅ Concurrency (D13) —
s_isLongLivedHostis write-once-before-use; local variables safely snapshot inside lock - ✅ Performance (D3) —
RequiresTransientTaskHostis O(1) string compare; process spawn overhead negligible vs restore I/O - ✅ API Surface (D8) — All new members are
internal; TaskRouter is internal class - ✅ Logging (D6) —
MessageImportance.Lowappropriate; properly captured in binary log - ✅ Test Coverage (D4) — Comprehensive: unit + integration, MT + server, positive + negative, fresh process verification
Recommendation
The IPC serialization gap (finding #1) is the most impactful issue. In pure server mode without multi-threaded enabled, the workaround won't protect OOP worker nodes. Consider either serializing the field or documenting this as a known limitation. The other findings are minor/moderate.
Note
🔒 Integrity filter blocked 2 items
The following items were blocked because they don't meet the GitHub integrity level.
- #13660
pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved". - #13660
pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
To allow these resources, lower min-integrity in your GitHub frontmatter:
tools:
github:
min-integrity: approved # merged | approved | unapproved | noneGenerated by Expert Code Review (command) for issue #13660 · ● 20.6M
Fixes #13315
Context
NuGet's RestoreTask holds static singletons (PluginManager, EnvironmentVariableWrapper) that assume
one invocation per process. Two MSBuild modes break that assumption:
builds back-to-back.
In both cases NuGet's static state survives past its intended scope.
This PR routes RestoreTask to a transient TaskHost (nodeReuse=false) when either mode is active, so
the spawned MSBuild.exe exits after Execute() and statics die with it
Changes Made
Why the original attempt didn't work
Original commit.
The server-mode trigger read
Traits.Instance.UseMSBuildServer, which checksMSBUILDUSESERVER. That env var is0/nullin the worker process where tasks run,stripped by:
NodeLauncher.DisableMSBuildServerzeroes it before spawning the Server child(recursion guard).
OutOfProcServerNode.HandleServerNodeBuildCommandoverwrites the server's env fromthe client's snapshot, which doesn't include MSBuild internals.
Net effect: the workaround branch never fired.
What this PR does
BuildParameters.IsLongLivedHostflag, defaulted from a process-wide staticset via
BuildParameters.MarkProcessAsLongLivedHost().OutOfProcServerNode.Run()callsthis once at startup so every per-build
BuildParametersinherits the flag.TaskRouter.RequiresTransientTaskHost(currently the single entryNuGet.Build.Tasks.RestoreTask).AssemblyTaskFactory.CreateTaskInstance: when the task is on the allow-list AND(
MultiThreaded == trueORIsLongLivedHost == true), forceuseSidecarTaskHost = falseso the spawned TaskHost is launched with
nodeReuse=falseand dies atEndBuild.Diagnostic logging added
Per-invocation TaskHost diagnostics in
TaskHostTask.Execute(low importance, capturedin binlog) — complements the existing
ExecutingTaskInTaskHostmessage. RecordsProcessId,ParentProcessId,NewNodeContext,IsSidecar,NodeReuseEffective.Useful when investigating any TaskHost-routing question.
Testing
Unit tests in
src/Build.UnitTests/BackEnd/TaskRouter_IntegrationTests.cs:Manual end-to-end with
dotnet restore App.csproj /bl:r.binlog:With
DOTNET_CLI_USE_MSBUILD_SERVER=1(server mode) and again with/mt, opened thebinlog and confirmed two
TaskHost details for task "RestoreTask"lines withIsSidecar=False,NodeReuseEffective=False, and differentProcessIdbetweenmultiple dotnet restore invocations.
End-to-end repro — manual two-PAT scenario against the private
DevDiv
vssdkNuGet feed, exercising NuGet's staticEnvironmentWrapper/PluginManagerover a single long-lived MSBuild Server process. Onmainthe second restore reproduces the bug (
401 Unauthorized); with this PRit succeeds. Details and steps in the PR comment below. Workaround: route NuGet RestoreTask to transient TaskHost in server or mt modes #13660 (comment)
Notes