Skip to content

GC heap corruption with GCLargePages (Part 2: no aggressive GC required) #127892

@BenV

Description

@BenV

Description

When DOTNET_GCLargePages=1 is enabled, the GC can corrupt heap memory during normal operation without any explicit GC.Collect calls. Stale object references appear to survive in regions that are "decommitted" (a no-op with large pages) and later reused, causing NullReferenceException and AccessViolationException in ConcurrentDictionary internals.

This appears to be a different issue from #126903 (which required GCCollectionMode.Aggressive). This bug triggers during normal GC operation under memory pressure (achieved by setting DOTNET_GCHighMemPercent to a lower than default value. No user-initiated/explicit collections are needed. It reproduces with both the stock .NET 10 GC and the patched GC from commit 5713889. We have not yet attempted reproducing with a patched GC that implements #127290

Reproduction Steps

Minimal repro program

Minimal reproduction project:

gc-largepages-repro-2.zip

Option 1: Docker (requires --privileged for huge page setup)

# Triggers corruption, usually within several minutes:
./run.sh

Option 2: Run locally on Linux

Requires real kernel huge pages allocated ahead of time (at least 2048 pages = 4GB):

# Reserve huge pages (requires root):
echo 2048 | sudo tee /proc/sys/vm/nr_hugepages

# Build:
dotnet build -c Release

# Triggers corruption:
DOTNET_GCLargePages=1 DOTNET_GCRegionRange=0x100000000 DOTNET_GCHeapHardLimit=0x100000000 \
DOTNET_GCHighMemPercent=0x26 DOTNET_GCPath=./libclrgcexp.so \
dotnet bin/Release/net10.0/GCLargePagesRepro.dll

# Does not trigger corruption (no large pages):
DOTNET_GCRegionRange=0x100000000 DOTNET_GCHeapHardLimit=0x100000000 \
DOTNET_GCHighMemPercent=0x26 DOTNET_GCPath=./libclrgcexp.so \
dotnet bin/Release/net10.0/GCLargePagesRepro.dll

The repro:

  1. Creates several ConcurrentDictionary objects and creates SOH/LOH memory pressure and churn.
  2. Loops until a time limit expires or corruption is seen.

Expected behavior

Normal use of ConcurrentDictionary should not cause heap corruption under memory pressure when using GCLargePages=1

Actual behavior

Heap corruption and access violations or null reference exceptions. Corruption typically
occurs within minutes of starting:

=== Running: GCRegionRange=4GB, GCHighMemPercent=38%, real huge pages ===

Allocating 2048 huge pages (4096MB)...
Huge pages allocated: 2048 (4096MB)
THP status: [always] madvise never
GCLargePages: 1
GCHighMemPercent: 0x26

=== GCLargePages Heap Corruption Repro ===
Duration: 240 minutes
Dicts: 4, Writers: 8, Readers: 8
Server GC: True, Latency: Interactive
Runtime: .NET 10.0.7

GC Configuration:
  GCLargePages = True
  GCHighMemPercent = 38
  GCRegionRange = 4294967296
  GCHeapHardLimit = 4294967296
  GCDynamicAdaptationMode = 0

[00:00:05] Entries=669,907 | Heap=274MB | Load=4.0% | GC2=3(+3) | NRE=0 AV=0 OOM=0
[NRE] Reader-3: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-1: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-7: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-2: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-5: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-4: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-0: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-0: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-5: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-6: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-6: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-5: System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
   at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
Fatal error.
System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at System.Collections.Concurrent.ConcurrentDictionary`2[[System.__Canon, System.Private.CoreLib, Version=10.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=10.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].TryGetValue(System.__Canon, System.__Canon ByRef)
   at Program.ReaderLoop(Int32)
   at Program+<>c__DisplayClass9_2.<Main>b__1()
   at System.Threading.Thread.StartCallback()

The corruption cascades rapidly and multiple threads fault at the same millisecond,
consistent with a single GC event corrupting a heap region that multiple threads
then read from.

Regression?

No

Known Workarounds

Do not enable GCLargePages

Configuration

  • Versions: .NET 10.0.7
  • OS: Linux (tested on official Microsoft .NET Docker images)
  • Our production servers are on Ubuntu 22.04/kernel 5.15.0
  • Arch: x86_64
  • Server/Concurrent GC enabled
  • Happens both with stock GC and 10.x patched with previous fix: 5713889
  • DOTNET_GCLargePages=1 with real kernel huge pages
  • DOTNET_GCHighMemPercent can be used to tune GC behavior to cause corruption more easily

Other information

We discovered this issue while trying to track down a problem we're hitting recently in production when GCLargePages mode is active - we are occasionally getting NullReferenceExceptions related to ConcurrentDictionary where it does not seem possible to throw a NullReferenceException:

System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.GrowTable(Tables tables, Boolean resizeDesired, Boolean forceRehashIfNonRandomized)

We originally reported this as #126903 and one cause of
heap corruption was fixed and backported to .NET 10.

Unfortunately after running with the patched GC from that fix we ran into the issue in production after several weeks, even though that same custom GC resolved the issue in our standalone reproduction program. This issue is an attempt to reproduce the heap corruption we're experiencing without forcing aggressive GCs.

If it is helpful at all, here's is an AI-generated theory on why the previous fix might have been incomplete:


Theory: The bug is in virtual_commit not zeroing on recommit

The most likely root cause is the recommit path, not just the decommit path:

gc.cpp line 7515-7517:

bool commit_succeeded_p = ((h_number >= 0) ? (use_large_pages_p ? true :
                          virtual_alloc_commit_for_heap (address, size, h_number)) :
                          GCToOSInterface::VirtualCommit(address, size));

When use_large_pages_p=true, virtual_commit returns true immediately without touching memory. On normal pages, VirtualCommit → mmap(MAP_FIXED) provides zeroed pages. With large pages, the memory retains whatever was there before.

Why the previous fix (5713889) is incomplete

The previous fix addressed decommit_region() (line 45081-45093) which IS properly handled — it explicitly clears to heap_segment_used() for large pages. That fix solved the aggressive GC path.

But there are still dangerous paths:

  • decommit_heap_segment (line 12657) — decommits a full segment under high memory load with no end_of_data parameter, meaning zero clearing happens on large pages.
  • The gap between used and committed — even in the safe decommit_region, the clearing only goes to heap_segment_used(), NOT heap_segment_committed(). If a recommitted region's new allocations extend past the old used marker, they hit uncleared stale data.
  • GC internal metadata — when a region is recycled, the GC resets its bookkeeping pointers but doesn't necessarily zero the plug tree data, brick table entries, or free list threading within the region. During the next GC sweep, stale internal structures could be misinterpreted.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions