Skip to content

Workaround: route NuGet RestoreTask to transient TaskHost in server or mt modes#13660

Merged
OvesN merged 16 commits into
dotnet:mainfrom
OvesN:workaround-restore-issues-in-mt
May 14, 2026
Merged

Workaround: route NuGet RestoreTask to transient TaskHost in server or mt modes#13660
OvesN merged 16 commits into
dotnet:mainfrom
OvesN:workaround-restore-issues-in-mt

Conversation

@OvesN
Copy link
Copy Markdown
Contributor

@OvesN OvesN commented Apr 30, 2026

Fixes #13315

Context

NuGet's RestoreTask holds static singletons (PluginManager, EnvironmentVariableWrapper) that assume
one invocation per process. Two MSBuild modes break that assumption:

  • MSBuild Server (DOTNET_CLI_USE_MSBUILD_SERVER=1 / MSBUILDUSESERVER=1): one process services many
    builds back-to-back.
  • Multi-threaded MSBuild (/mt): RestoreTask is already routed to a TaskHost for thread safety, because it is not migrated, but it's a long-lived sidecar reused for every invocation in the build — statics leak between projects.

In both cases NuGet's static state survives past its intended scope.

This PR routes RestoreTask to a transient TaskHost (nodeReuse=false) when either mode is active, so
the spawned MSBuild.exe exits after Execute() and statics die with it

Changes Made

Why the original attempt didn't work

Original commit.
The server-mode trigger read Traits.Instance.UseMSBuildServer, which checks
MSBUILDUSESERVER. That env var is 0/null in the worker process where tasks run,
stripped by:

  1. NodeLauncher.DisableMSBuildServer zeroes it before spawning the Server child
    (recursion guard).
  2. OutOfProcServerNode.HandleServerNodeBuildCommand overwrites the server's env from
    the client's snapshot, which doesn't include MSBuild internals.

Net effect: the workaround branch never fired.

What this PR does

  • New internal BuildParameters.IsLongLivedHost flag, defaulted from a process-wide static
    set via BuildParameters.MarkProcessAsLongLivedHost(). OutOfProcServerNode.Run() calls
    this once at startup so every per-build BuildParameters inherits the flag.
  • Allow-list TaskRouter.RequiresTransientTaskHost (currently the single entry
    NuGet.Build.Tasks.RestoreTask).
  • In AssemblyTaskFactory.CreateTaskInstance: when the task is on the allow-list AND
    (MultiThreaded == true OR IsLongLivedHost == true), force useSidecarTaskHost = false
    so the spawned TaskHost is launched with nodeReuse=false and dies at EndBuild.

Diagnostic logging added

Per-invocation TaskHost diagnostics in TaskHostTask.Execute (low importance, captured
in binlog) — complements the existing ExecutingTaskInTaskHost message. Records
ProcessId, ParentProcessId, NewNodeContext, IsSidecar, NodeReuseEffective.
Useful when investigating any TaskHost-routing question.

Testing

Unit tests in src/Build.UnitTests/BackEnd/TaskRouter_IntegrationTests.cs:

Manual end-to-end with dotnet restore App.csproj /bl:r.binlog:

  • With DOTNET_CLI_USE_MSBUILD_SERVER=1 (server mode) and again with /mt, opened the
    binlog and confirmed two TaskHost details for task "RestoreTask" lines with
    IsSidecar=False, NodeReuseEffective=False, and different ProcessId between
    multiple dotnet restore invocations.

  • End-to-end repro — manual two-PAT scenario against the private
    DevDiv vssdk NuGet feed, exercising NuGet's static EnvironmentWrapper /
    PluginManager over a single long-lived MSBuild Server process. On main
    the second restore reproduces the bug (401 Unauthorized); with this PR
    it succeeds. Details and steps in the PR comment below. Workaround: route NuGet RestoreTask to transient TaskHost in server or mt modes #13660 (comment)

Notes

AR-May and others added 2 commits April 22, 2026 10:52
Workaround for static singleton state issues in NuGet RestoreTask
(e.g., PluginManager, EnvironmentWrapper) that persist across builds
when running in sidecar TaskHost processes.

When /mt mode or MSBuild server (MSBUILDUSESERVER=1) is active,
RestoreTask is now forced to run in a transient (non-sidecar) TaskHost
that terminates after execution, ensuring all static state is cleaned up.

Changes:
- TaskRouter: Add IsKnownProblematicTask() to identify tasks by full name
- AssemblyTaskFactory: Force transient TaskHost for problematic tasks
- Tests: Add unit and integration tests for the workaround

Fixes dotnet#13315

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…alue of MSBUILDUSESERVER. Add logging for task host spawning details.
@OvesN OvesN force-pushed the workaround-restore-issues-in-mt branch from 9793967 to a1ce582 Compare April 30, 2026 13:08
@OvesN OvesN changed the title Workaround restore issues in mt Workaround: route NuGet RestoreTask to transient TaskHost in server or mt modes Apr 30, 2026
@OvesN OvesN marked this pull request as ready for review April 30, 2026 13:53
Copilot AI review requested due to automatic review settings April 30, 2026 13:53
@OvesN
Copy link
Copy Markdown
Contributor Author

OvesN commented Apr 30, 2026

/review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 30, 2026

Expert Code Review (command) completed successfully!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a workaround to isolate NuGet’s RestoreTask (which relies on process-wide static singletons) by forcing it onto a transient TaskHost when running under MSBuild Server mode or /mt, preventing static state from leaking across builds/invocations.

Changes:

  • Add an internal “original server mode” env var (_MSBUILDORIGINALUSESERVER) to preserve server-mode detection through environment snapshotting.
  • Route allow-listed “known problematic” tasks (currently NuGet.Build.Tasks.RestoreTask) to a non-sidecar (transient) TaskHost in /mt or server mode.
  • Add TaskHost diagnostic logging and integration tests validating routing + per-invocation process isolation.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/Framework/Traits.cs Adds _MSBUILDORIGINALUSESERVER env var name and trait for detecting server-mode launch context.
src/Build/BackEnd/Components/Communications/NodeLauncher.cs Stashes/restores the original MSBUILDUSESERVER value into the new internal env var around child process creation.
src/Build/BackEnd/Node/OutOfProcServerNode.cs Preserves the internal env var across SetEnvironment(...) so server node can still detect original server-mode intent.
src/Build/BackEnd/Components/RequestBuilder/TaskRouter.cs Introduces allow-list logic to identify “known problematic” tasks by full type name.
src/Build/Instance/TaskFactories/AssemblyTaskFactory.cs Forces problematic tasks to run in TaskHost and disables sidecar reuse in /mt or server mode.
src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcTaskHost.cs Extends host acquisition to report host PID / creation status for diagnostics.
src/Build/Instance/TaskFactories/TaskHostTask.cs Logs per-invocation TaskHost details (PID, reuse, etc.) into the build log/binlog.
src/Build.UnitTests/BackEnd/TaskRouter_IntegrationTests.cs Adds integration tests for problematic-task routing and “fresh process per invocation” behavior.

Comment thread src/Build.UnitTests/BackEnd/TaskRouter_IntegrationTests.cs
Comment thread src/Framework/Traits.cs Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expert Review — 24-Dimension Analysis

Summary

# Dimension Verdict
1 Backward Compatibility ✅ LGTM (test-only)
2 ChangeWave Discipline ✅ LGTM (test-only)
3 Performance ✅ LGTM (test-only)
4 Allocation Awareness ✅ LGTM (test-only)
5 Test Coverage ✅ LGTM — covers MT mode, server mode, fresh-process guarantee, and no-workaround fallback
6 Error Message Quality ✅ LGTM (test-only)
7 Logging Fidelity ✅ LGTM (test-only)
8 String Comparison ✅ LGTM — regex matches ASCII digits only, ShouldContain is ordinal
9 API Surface ✅ LGTM (test-only)
10 Target Authoring ✅ LGTM (test-only)
11 Cross-Platform ✅ LGTM — uses Path.Combine, Process.GetCurrentProcess().Id, standard APIs
12 Code Simplification ✅ LGTM — boilerplate duplication is acceptable for test readability
13 Concurrency ✅ LGTM — xunit.runner.json enforces parallelizeTestCollections: false / maxParallelThreads: 1
14 Naming Precision ✅ LGTM — names are descriptive and consistent with existing tests
15 SDK Integration ✅ LGTM (test-only)
16 Evaluation Model ✅ LGTM (test-only)
17 Correctness ✅ LGTM — fake RestoreTask FullName matches TaskRouter.IsKnownProblematicTask check; Build() overload is self-contained
18 Documentation Accuracy 📝 NIT — see inline comment
19 Dependency Management ✅ LGTM (test-only)
20 Scope Discipline ✅ LGTM — tests + comment cleanup in the same PR is reasonable
21 Security ✅ LGTM (test-only)
22 Build Infrastructure ✅ LGTM (test-only)
23 Binary Log Compatibility ✅ LGTM (test-only)
24 Error Handling ✅ LGTM (test-only)

Findings: 1 NIT

One nit on the server-mode test's EnableNodeReuse comment// Load-bearing: see the /mt counterpart. is a fragile cross-reference that could become a dead pointer. Consider duplicating the 1-line explanation inline. Details in the inline comment.

The comment cleanup is well-executed overall: removed XML docs restated what the method name already says, // Arrange/// Act/// Assert markers were noise for these straightforward tests, and the condensed // Load-bearing: comments on the MT test preserve the non-obvious reasoning. The tests themselves are correct and well-structured.

Note

🔒 Integrity filter blocked 2 items

The following items were blocked because they don't meet the GitHub integrity level.

  • #13660 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13660 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by Expert Code Review (command) for issue #13660 · ● 9.7M

Comment thread src/Build.UnitTests/BackEnd/TaskRouter_IntegrationTests.cs
@OvesN OvesN requested a review from AR-May April 30, 2026 14:08
Comment thread src/Build/BackEnd/Components/RequestBuilder/TaskRouter.cs Outdated
@OvesN
Copy link
Copy Markdown
Contributor Author

OvesN commented May 12, 2026

How this was validated end-to-end

Repro project + driver script attached as a zip.
RestoreServerRepro.zip

Why this scenario reproduces the bug

NuGet keeps the Azure Artifacts credential-provider plugin process alive across
restores via a process-wide static singleton (PluginManager.Instance). The
plugin itself parses VSS_NUGET_EXTERNAL_FEED_ENDPOINTS exactly once into a
Lazy<Dictionary<endpoint, credentials>>
(VstsBuildTaskServiceEndpointCredentialProvider.LazyExternalCredentials) and
reuses that dictionary for its lifetime. In a long-lived MSBuild Server process
the plugin is never restarted, so a second restore against the same server
keeps using the first restore's PAT — even after that PAT is revoked or rotated
and the parent shell's env now contains a fresh PAT.

The procedure below uses two PATs against
Microsoft.VisualStudio.OpenTelemetry,
a NuGet package hosted only on the DevDiv vssdk feed,
so every restore must authenticate against the private feed via the Azure
Artifacts
Credential Provider.


One-time setup

1. Generate two PATs in the ADO portalhttps://dev.azure.com/devdiv/_usersSettings/tokens

  • Organization: devdiv, Expiration: 1 day, Scopes → Custom defined → check Packaging → Read only
  • Copy both PAT strings

Procedure

# 1. Open a fresh PowerShell window
$patA = '<paste repro-A>'
$patB = '<paste repro-B>'

# 2. Run the script — it sets the env to PAT-A, runs Restore #1, pauses
cd D:\msbuild-repros\RestoreServerRepro
.\Repro.ps1 `
    -MSBuildEnv D:\msbuild\artifacts\msbuild-build-env.ps1 `
    -PatPrimary $patA `
    -PatSecondary $patB

# 3. Script pauses with "Press ENTER" prompt:
#    - Go to https://dev.azure.com/devdiv/_usersSettings/tokens
#    - Find 'repro-A', click Revoke, confirm
#    - Press ENTER in the script

Run this once against main (no fix) and once against this PR's branch.

Results

Build Restore 1 Restore 2
main (no fix) OK error NU1301: 401 (Unauthorized) — server's cached EnvironmentWrapper still holds the now-revoked PAT-A. Bug reproduced.
this PR OK OK — workaround spawns a transient TaskHost; the fresh credential plugin process reads the current env and gets PAT-B.

What's in the script

Repro.ps1 automates:

  • Sourcing msbuild-build-env.ps1 so the bootstrap MSBuild SDK is on PATH.
  • Killing any stale MSBuild Server before the run.
  • Building the env-var JSON {"endpointCredentials":[{"endpoint":"…/vssdk/…","username":"PAT","password":"…"}]}.
  • Running MSBuild.exe TestRestore.csproj -t:Restore -p:RestoreForce=true -p:RestoreNoHttpCache=true -bl:…
    (force + no-http-cache flags are required so Restore 2 actually contacts the
    feed; otherwise NuGet no-ops on the on-disk packages from Restore 1 and the
    credential path never runs).
  • Capturing both restores as binlogs (binlogs\restore-1.binlog, binlogs\restore-2.binlog).

Comment thread src/Build.UnitTests/BackEnd/TaskRouter_IntegrationTests.cs Outdated
Comment thread src/Build/BackEnd/Components/RequestBuilder/TaskRouter.cs Outdated
Comment thread src/Build/Instance/TaskFactories/AssemblyTaskFactory.cs Outdated
Comment thread src/Build/Instance/TaskFactories/TaskHostTask.cs Outdated
Comment thread src/Build/BackEnd/Components/Host/IHostInfo.cs Outdated
Copy link
Copy Markdown
Member

@AR-May AR-May left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be overcomplicating things a bit. I think we could instead introduce an internal property on BuildParameters: there are already some of those. BuildParameters only crosses the process boundary between the main process and worker nodes, so the serialization versioning concerns I initially had also don’t apply.

@OvesN
Copy link
Copy Markdown
Contributor Author

OvesN commented May 13, 2026

/review

@OvesN
Copy link
Copy Markdown
Contributor Author

OvesN commented May 13, 2026

@AR-May @JanProvaznik
New version of workaround done.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

Expert Code Review (command) completed successfully!

Concurrency & Thread Safety — LGTM

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expert Code Review — PR #13660

Summary

The approach is sound: routing tasks with problematic static state to transient (non-sidecar) TaskHost processes in long-lived host scenarios. The implementation is well-tested and follows established MSBuild patterns.

Findings Table

# Dimension Severity Finding
1 Correctness & IPC (D22) MAJOR IsLongLivedHost not included in ITranslatable.Translate() — workaround may not trigger on OOP worker nodes in pure server mode (without MT)
2 Code Simplification (D12) Moderate catch (Exception) too broad in PID extraction — should be catch (InvalidOperationException)
3 Error Message Quality (D5) Minor Localization comment lists "TaskHost" as non-localizable field name but it's prose text
4 Error Message Quality (D5) Minor "ParentProcessId" terminology slightly confusing when logged from the parent
5 Design (D10) Suggestion Consider FrozenSet<string> for extensibility; file issue for [RequiresProcessIsolation] attribute

Dimensions Evaluated as LGTM

  • Backwards Compatibility (D1) — Both triggers are opt-in features (MT mode, server mode); no break for traditional builds
  • Concurrency (D13)s_isLongLivedHost is write-once-before-use; local variables safely snapshot inside lock
  • Performance (D3)RequiresTransientTaskHost is O(1) string compare; process spawn overhead negligible vs restore I/O
  • API Surface (D8) — All new members are internal; TaskRouter is internal class
  • Logging (D6)MessageImportance.Low appropriate; properly captured in binary log
  • Test Coverage (D4) — Comprehensive: unit + integration, MT + server, positive + negative, fresh process verification

Recommendation

The IPC serialization gap (finding #1) is the most impactful issue. In pure server mode without multi-threaded enabled, the workaround won't protect OOP worker nodes. Consider either serializing the field or documenting this as a known limitation. The other findings are minor/moderate.

Note

🔒 Integrity filter blocked 2 items

The following items were blocked because they don't meet the GitHub integrity level.

  • #13660 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #13660 pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by Expert Code Review (command) for issue #13660 · ● 20.6M

Comment thread src/Build/BackEnd/Components/RequestBuilder/TaskRouter.cs
Comment thread src/Build/BackEnd/BuildManager/BuildParameters.cs
Comment thread src/Build/Resources/Strings.resx Outdated
Comment thread src/Build/BackEnd/BuildManager/BuildParameters.cs
Comment thread src/Build/Resources/xlf/Strings.cs.xlf Outdated
Copy link
Copy Markdown
Member

@AR-May AR-May left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@OvesN OvesN merged commit fd177d3 into dotnet:main May 14, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Execute Restore tasks in the TaskHost node in /mt mode or when msbuild server is on.

4 participants