Skip to content

functional tests: add process timeout + diagnostic logging for hang diagnosis#1956

Closed
tyrielv wants to merge 10 commits into
microsoft:masterfrom
tyrielv:tyrielv/fix-flaky-gitcmd-hang
Closed

functional tests: add process timeout + diagnostic logging for hang diagnosis#1956
tyrielv wants to merge 10 commits into
microsoft:masterfrom
tyrielv:tyrielv/fix-flaky-gitcmd-hang

Conversation

@tyrielv
Copy link
Copy Markdown
Contributor

@tyrielv tyrielv commented May 1, 2026

Problem

GitCommandsTests(Full) in CI slice 3 (Release, x86_64) hangs for 30+ minutes after completing OneTimeSetUp, causing the 60-minute job timeout with no diagnostic output identifying which specific test hangs. See run 25186133967.

Root cause: \ProcessHelper.StartProcess()\ calls \ReadToEnd()\ and \WaitForExit()\ with no timeout. If a git process stalls (lock contention, credential prompt, etc.), the entire test fixture blocks indefinitely.

Changes

Process timeout (\ProcessHelper.cs, \GitProcess.cs)

  • Add configurable timeout to \ProcessHelper.Run()\ / \StartProcess()\
  • Uses \ReadToEndAsync()\ + \Task.Wait(timeout)\ to bound the full process lifecycle
  • On timeout: kills entire process tree, throws \TimeoutException\ with process details
  • Logs slow processes (>30s) with timestamp, command, and elapsed time
  • \ProcessHelper.DefaultTimeoutMs: configurable via \GVFS_FT_PROCESS_TIMEOUT_SECONDS\ env var (default: infinite)
  • \GitProcess.DefaultGitTimeoutMs: 5 minutes per git operation, overridable via \GVFS_FT_GIT_TIMEOUT_SECONDS\

Diagnostic logging (\GitRepoTests.cs)

  • [TEST-SETUP-START/END]\ and [TEST-TEARDOWN-START/END]\ with timestamps and full test name
  • Identifies exactly which test is running when a hang occurs

CI restriction (\ unctional-tests.yaml) — temporary

  • Matrix restricted to slice 3, Release, x86_64 to iterate quickly on the flaky hang
  • Original full matrix preserved in comments for restoration after fix

Testing

  • C# code compiles clean (only pre-existing warnings from other projects)
  • All changes are backward-compatible (new parameters have defaults)
  • Timeout defaults: 5 min per git process (CI), infinite (local dev unless env var set)

Part of: AB#62098959
Parent: AB#61580834

tyrielv added 10 commits April 29, 2026 16:13
Retarget from net471 to net10.0-windows10.0.17763.0 across all managed
projects. Enable NativeAOT self-contained deployment, eliminating the
.NET runtime dependency.

Build infrastructure:
- global.json: pin SDK 10.0.203
- Directory.Build.props: centralized TFM, SelfContained, PublishAot,
  OptimizationPreference=Speed
- Directory.Build.targets: AOT build targets; opt out test projects and
  GVFS.MSBuild (netstandard2.0) from AOT
- Build.bat: 3-step build (dotnet restore, VS MSBuild for C++, dotnet
  publish for managed AOT binaries)
- publish-aot.ps1: standalone script for local AOT publish testing
  (CI uses Build.bat; this script is for dev iteration)
- Update output paths in all scripts (net471 -> net10.0-.../publish)
- Update CI to .NET 10 SDK and windows-2025 runner
- Update installer MinVersion to 10.0.17763

Package updates:
- Microsoft.Windows.ProjFS 1.1 -> 2.1.0: pure C# P/Invoke replacing
  C++/CLI interop, required for NativeAOT compatibility
- Microsoft.Data.Sqlite 2.2.4 -> 9.0.4, Microsoft.Build.* 16 -> 17.12.6
- Add System.Diagnostics.EventLog, System.IO.Pipes.AccessControl:
  previously included in .NET Framework, now separate packages
- Remove GVFS.ProjFS (ProjFS is now a Windows OS feature)

Unit test fixture updates for new ProjFS managed API surface.

Output: ~20 MB native GVFS.exe, 36.7 MB installer (vs 107 MB with
full self-contained runtime)

Assisted-by: Claude Opus 4.6
Signed-off-by: Tyrie Vella <tyrielv@gmail.com>
Assembly.Location returns empty string under NativeAOT since there is no
managed assembly on disk. Assembly.GetName().Version returns null.

- ProcessHelper: use Environment.ProcessPath with null guard (can be null
  in certain hosting scenarios), fall back to AppContext.BaseDirectory
- HooksInstaller: same Environment.ProcessPath pattern with null guard
- GVFSEnlistment: AppDomain.CurrentDomain.FriendlyName replaces
  Assembly.GetEntryAssembly().GetName() for process name
- JsonTracer/PrettyConsoleEventListener: same pattern for version string

Assisted-by: Claude Opus 4.6
Signed-off-by: Tyrie Vella <tyrielv@gmail.com>
NamedPipeServerStream (WindowsPlatform.cs):
  ACL-accepting constructor removed from .NET Core; use
  NamedPipeServerStreamAcl.Create extension method.

Directory ACL APIs (WindowsFileSystem.cs, GVFSService.Windows.cs):
  Static Directory.GetAccessControl/SetAccessControl and
  Directory.CreateDirectory(path, security) removed from .NET Core;
  replaced with DirectoryInfo instance methods and
  DirectorySecurity.CreateDirectory extension.

Uri escaping (CloneVerb.cs, GVFSVerb.cs, OrgInfoApiClient.cs):
  Uri.EscapeUriString obsoleted in .NET 10 (does not escape '#', '?');
  use Uri.EscapeDataString. HttpUtility.UrlEncode (System.Web) replaced
  with WebUtility.UrlEncode (System.Net).

UseShellExecute (WindowsPlatform.cs, InProcessMount.cs):
  .NET Framework defaults UseShellExecute=true (ShellExecuteEx, no handle
  inheritance). .NET 10 defaults to false (CreateProcess, handles inherited).
  Without this, GVFS.Mount.exe inherits the caller's stdout pipe handle,
  causing callers that read to EOF to block indefinitely.

Truncated loose object detection (GitRepo.cs):
  .NET 10 DeflateStream silently returns partial data on truncated zlib
  instead of throwing InvalidDataException. CountingStream wrapper compares
  actual bytes read to header-declared size to detect corruption.

Assisted-by: Claude Opus 4.6
Signed-off-by: Tyrie Vella <tyrielv@gmail.com>
System.Management requires COM interop which is incompatible with
NativeAOT. Replace WMI queries (MSFT_Volume, MSFT_Partition, MSFT_Disk,
MSFT_PhysicalDisk) with direct kernel32 DeviceIoControl calls using
IOCTL_STORAGE_QUERY_PROPERTY and IOCTL_VOLUME_GET_VOLUME_DISK_EXTENTS
for disk telemetry collection.

Assisted-by: Claude Opus 4.6
Signed-off-by: Tyrie Vella <tyrielv@gmail.com>
ProjFS managed API v2.1.0 uses Marshal.PtrToStringUni which returns null
for IntPtr.Zero (kernel operations with PID 0). The old C++/CLI wrapper
returned String.Empty. Null-coalesce to match old behavior in all three
callback sites (OnPlaceholderFileCreated, OnPlaceholderFolderCreated,
OnPlaceholderFileHydrated); ConcurrentDictionary does not accept null keys.

Assisted-by: Claude Opus 4.6
Signed-off-by: Tyrie Vella <tyrielv@gmail.com>
Replace HttpClientHandler with SocketsHttpHandler for explicit connection
pool lifecycle management: configurable MaxConnectionsPerServer (2x CPU
count), PooledConnectionLifetime, and PooledConnectionIdleTimeout. Remove
UseDefaultCredentials (not supported on SocketsHttpHandler) and
ServicePointManager usage (.NET Framework only).

GitSsl: X509Certificate2(byte[]) constructor obsoleted; use
X509CertificateLoader.LoadCertificate.

GitAuthentication: adapt credential flow for new HTTP handler.

Assisted-by: Claude Opus 4.6
Signed-off-by: Tyrie Vella <tyrielv@gmail.com>
NativeAOT cannot use runtime reflection for JSON serialization.
GVFSJsonContext provides source-generated System.Text.Json serializers
for 25+ types used in named pipe messages and configuration.

GVFSJsonOptions chains source-gen (primary) with reflection fallback
for types not yet in the context, allowing incremental migration.

NamedPipeMessages: add parameterless constructors required by the
source generator's deserialization codegen.

Assisted-by: Claude Opus 4.6
Signed-off-by: Tyrie Vella <tyrielv@gmail.com>
.NET 10's FileInfo property setters no longer open write handles that
trigger ProjFS placeholder hydration. Adapt tests that relied on this.

BasicFileSystemTests: replace ExpandedFileAttributesAreUpdated with two
focused tests:
- PlaceholderMetadataSurvivesHydration: sets timestamps + Hidden on a
  placeholder, verifies they took effect, hydrates via read+write, and
  asserts CreationTime and Hidden survived the conversion.
- HydratedFileTimestampsAndAttributesAreUpdated: hydrates first, then
  sets all properties and verifies they stick.

GitCommandsTests: ChangeTimestampAndDiff now explicitly hydrates via
read+write before adjusting timestamps, since File.SetLastWriteTime
no longer triggers ProjFS hydration.

GVFSProcess: add 5-minute timeout per gvfs process invocation to
prevent CI hangs. Stream stdout/stderr for real-time CI output.

functional-tests.yaml: reduce mount sleep from 500ms to 100ms,
add timeout-minutes and --workers=1 for sequential execution.

Assisted-by: Claude Opus 4.6
Signed-off-by: Tyrie Vella <tyrielv@gmail.com>
…iagnosis

Add per-process timeout to ProcessHelper and GitProcess to prevent
functional tests from hanging indefinitely when a git process stalls.
Previously, ProcessHelper.StartProcess() called ReadToEnd() and
WaitForExit() with no timeout, causing the entire CI job to hit the
60-minute GitHub Actions timeout with no diagnostics.

Changes:
- ProcessHelper: add configurable timeout with async stdout read.
  Uses ReadToEndAsync() + Task.Wait(timeout) to bound the entire
  process lifecycle. On timeout, kills the process tree and throws
  TimeoutException with process details. Logs slow processes (>30s).
  Configurable via GVFS_FT_PROCESS_TIMEOUT_SECONDS env var.
- GitProcess: default 5-minute timeout per git operation, overridable
  via GVFS_FT_GIT_TIMEOUT_SECONDS env var.
- GitRepoTests: add timestamped [TEST-SETUP-START/END] and
  [TEST-TEARDOWN-START/END] logging with full test name to identify
  which specific test hangs.
- functional-tests.yaml: temporarily restrict matrix to slice 3
  (Release, x86_64) to iterate on the flaky GitCommandsTests(Full)
  hang. Set GVFS_FT_GIT_TIMEOUT_SECONDS=300 in CI.

AB#62098959

Assisted-by: Claude Opus 4.6
Signed-off-by: Tyrie Vella <tyrielv@gmail.com>
Single run of slice 3 passed without reproducing the hang. Add an
'attempt' matrix dimension (1-5) to run 5 parallel copies of the
same test slice, increasing the chance of hitting the race condition.

AB#62098959

Assisted-by: Claude Opus 4.6
Signed-off-by: Tyrie Vella <tyrielv@gmail.com>
@tyrielv
Copy link
Copy Markdown
Contributor Author

tyrielv commented May 1, 2026

Root cause found: stdout truncation in WaitForExit(timeout). Fix applied to net10-pr (#1953) instead.

@tyrielv tyrielv closed this May 1, 2026
@tyrielv tyrielv deleted the tyrielv/fix-flaky-gitcmd-hang branch May 1, 2026 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant