GC heap corruption with GCLargePages (Part 2: no aggressive GC required)

### Description

When `DOTNET_GCLargePages=1` is enabled, the GC can corrupt heap memory during normal operation without any explicit `GC.Collect` calls. Stale object references appear to survive in regions that are "decommitted" (a no-op with large pages) and later reused, causing `NullReferenceException` and `AccessViolationException` in `ConcurrentDictionary` internals.

This appears to be a different issue from [#126903](https://github.com/dotnet/runtime/issues/126903) (which required `GCCollectionMode.Aggressive`). This bug triggers during normal GC operation under memory pressure (achieved by setting `DOTNET_GCHighMemPercent` to a lower than default value. No user-initiated/explicit collections are needed. It reproduces with both the stock .NET 10 GC and the patched GC from commit 5713889eb3f38b8. We have not yet attempted reproducing with a patched GC that implements https://github.com/dotnet/runtime/pull/127290


### Reproduction Steps

### Minimal repro program

Minimal reproduction project: 

[gc-largepages-repro-2.zip](https://github.com/user-attachments/files/27459583/gc-largepages-repro-2.zip)

#### Option 1: Docker (requires `--privileged` for huge page setup)

```bash
# Triggers corruption, usually within several minutes:
./run.sh
```

#### Option 2: Run locally on Linux

Requires real kernel huge pages allocated ahead of time (at least 2048 pages = 4GB):

```bash
# Reserve huge pages (requires root):
echo 2048 | sudo tee /proc/sys/vm/nr_hugepages

# Build:
dotnet build -c Release

# Triggers corruption:
DOTNET_GCLargePages=1 DOTNET_GCRegionRange=0x100000000 DOTNET_GCHeapHardLimit=0x100000000 \
DOTNET_GCHighMemPercent=0x26 DOTNET_GCPath=./libclrgcexp.so \
dotnet bin/Release/net10.0/GCLargePagesRepro.dll

# Does not trigger corruption (no large pages):
DOTNET_GCRegionRange=0x100000000 DOTNET_GCHeapHardLimit=0x100000000 \
DOTNET_GCHighMemPercent=0x26 DOTNET_GCPath=./libclrgcexp.so \
dotnet bin/Release/net10.0/GCLargePagesRepro.dll
```

The repro:

1. Creates several `ConcurrentDictionary` objects  and creates SOH/LOH memory pressure and churn.
2. Loops until a time limit expires or corruption is seen.

### Expected behavior

Normal use of `ConcurrentDictionary` should not cause heap corruption under memory pressure when using `GCLargePages=1`

### Actual behavior

Heap corruption and access violations or null reference exceptions. Corruption typically
occurs within minutes of starting:

```
=== Running: GCRegionRange=4GB, GCHighMemPercent=38%, real huge pages ===

Allocating 2048 huge pages (4096MB)...
Huge pages allocated: 2048 (4096MB)
THP status: [always] madvise never
GCLargePages: 1
GCHighMemPercent: 0x26

=== GCLargePages Heap Corruption Repro ===
Duration: 240 minutes
Dicts: 4, Writers: 8, Readers: 8
Server GC: True, Latency: Interactive
Runtime: .NET 10.0.7

GC Configuration:
  GCLargePages = True
  GCHighMemPercent = 38
  GCRegionRange = 4294967296
  GCHeapHardLimit = 4294967296
  GCDynamicAdaptationMode = 0

[00:00:05] Entries=669,907 | Heap=274MB | Load=4.0% | GC2=3(+3) | NRE=0 AV=0 OOM=0
[NRE] Reader-3: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-1: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-7: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-2: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-5: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-4: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-0: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-0: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-5: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-6: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-6: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-5: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
Fatal error.
System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at System.Collections.Concurrent.ConcurrentDictionary`2[[System.__Canon, System.Private.CoreLib, Version=10.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=10.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].TryGetValue(System.__Canon, System.__Canon ByRef)
   at Program.ReaderLoop(Int32)
   at Program+<>c__DisplayClass9_2.<Main>b__1()
   at System.Threading.Thread.StartCallback()
```

The corruption cascades rapidly and multiple threads fault at the same millisecond,
consistent with a single GC event corrupting a heap region that multiple threads
then read from.

### Regression?

No

### Known Workarounds

Do not enable `GCLargePages`

### Configuration

- Versions: .NET 10.0.7
- OS: Linux (tested on official Microsoft .NET Docker images)
- Our production servers are on Ubuntu 22.04/kernel 5.15.0
- Arch: x86_64
- Server/Concurrent GC enabled
- Happens both with stock GC and 10.x patched with previous fix: https://github.com/dotnet/runtime/commit/5713889eb3f38b8621c1b8574b87d7265157ca49
- `DOTNET_GCLargePages=1` with real kernel huge pages
- `DOTNET_GCHighMemPercent` can be used to tune GC behavior to cause corruption more easily


### Other information

We discovered this issue while trying to track down a problem we're hitting recently in production when `GCLargePages` mode is active - we are occasionally getting `NullReferenceExceptions` related to `ConcurrentDictionary` where it does not seem possible to throw a `NullReferenceException`:

```
System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.GrowTable(Tables tables, Boolean resizeDesired, Boolean forceRehashIfNonRandomized)
```

We originally reported this as https://github.com/dotnet/runtime/issues/126903 and one cause of
heap corruption was fixed and backported to .NET 10. 

Unfortunately after running with the patched GC from that fix [we ran into the issue in production](https://github.com/dotnet/runtime/issues/126903#issuecomment-4384898572) after several weeks, even though that same custom GC resolved the issue in our standalone reproduction program. This issue is an attempt to reproduce the heap corruption we're experiencing without forcing aggressive GCs.

If it is helpful at all, here's is an AI-generated theory on why the previous fix might have been incomplete:

---

### Theory: The bug is in virtual_commit not zeroing on recommit
#### The most likely root cause is the recommit path, not just the decommit path:

gc.cpp line 7515-7517:

```c
bool commit_succeeded_p = ((h_number >= 0) ? (use_large_pages_p ? true :
                          virtual_alloc_commit_for_heap (address, size, h_number)) :
                          GCToOSInterface::VirtualCommit(address, size));
```

When `use_large_pages_p=true`, `virtual_commit` returns true immediately without touching memory. On normal pages, VirtualCommit → mmap(MAP_FIXED) provides zeroed pages. With large pages, the memory retains whatever was there before.

#### Why the previous fix (5713889) is incomplete
The previous fix addressed `decommit_region()` (line 45081-45093) which IS properly handled — it explicitly clears to `heap_segment_used()` for large pages. That fix solved the aggressive GC path.

#### But there are still dangerous paths:

- `decommit_heap_segment` (line 12657) — decommits a full segment under high memory load with no `end_of_data` parameter, meaning zero clearing happens on large pages.
- The gap between used and committed — even in the safe `decommit_region`, the clearing only goes to `heap_segment_used()`, NOT `heap_segment_committed()`. If a recommitted region's new allocations extend past the old used marker, they hit uncleared stale data.
- GC internal metadata — when a region is recycled, the GC resets its bookkeeping pointers but doesn't necessarily zero the plug tree data, brick table entries, or free list threading within the region. During the next GC sweep, stale internal structures could be misinterpreted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC heap corruption with GCLargePages (Part 2: no aggressive GC required) #127892

Description

Reproduction Steps

Minimal repro program

Option 1: Docker (requires `--privileged` for huge page setup)

Option 2: Run locally on Linux

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

Theory: The bug is in virtual_commit not zeroing on recommit

The most likely root cause is the recommit path, not just the decommit path:

Why the previous fix (`5713889`) is incomplete

But there are still dangerous paths:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

GC heap corruption with GCLargePages (Part 2: no aggressive GC required) #127892

Description

Description

Reproduction Steps

Minimal repro program

Option 1: Docker (requires --privileged for huge page setup)

Option 2: Run locally on Linux

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

Theory: The bug is in virtual_commit not zeroing on recommit

The most likely root cause is the recommit path, not just the decommit path:

Why the previous fix (5713889) is incomplete

But there are still dangerous paths:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Option 1: Docker (requires `--privileged` for huge page setup)

Why the previous fix (`5713889`) is incomplete