Description
When DOTNET_GCLargePages=1 is enabled, the GC can corrupt heap memory during normal operation without any explicit GC.Collect calls. Stale object references appear to survive in regions that are "decommitted" (a no-op with large pages) and later reused, causing NullReferenceException and AccessViolationException in ConcurrentDictionary internals.
This appears to be a different issue from #126903 (which required GCCollectionMode.Aggressive). This bug triggers during normal GC operation under memory pressure (achieved by setting DOTNET_GCHighMemPercent to a lower than default value. No user-initiated/explicit collections are needed. It reproduces with both the stock .NET 10 GC and the patched GC from commit 5713889. We have not yet attempted reproducing with a patched GC that implements #127290
Reproduction Steps
Minimal repro program
Minimal reproduction project:
gc-largepages-repro-2.zip
Option 1: Docker (requires --privileged for huge page setup)
# Triggers corruption, usually within several minutes:
./run.sh
Option 2: Run locally on Linux
Requires real kernel huge pages allocated ahead of time (at least 2048 pages = 4GB):
# Reserve huge pages (requires root):
echo 2048 | sudo tee /proc/sys/vm/nr_hugepages
# Build:
dotnet build -c Release
# Triggers corruption:
DOTNET_GCLargePages=1 DOTNET_GCRegionRange=0x100000000 DOTNET_GCHeapHardLimit=0x100000000 \
DOTNET_GCHighMemPercent=0x26 DOTNET_GCPath=./libclrgcexp.so \
dotnet bin/Release/net10.0/GCLargePagesRepro.dll
# Does not trigger corruption (no large pages):
DOTNET_GCRegionRange=0x100000000 DOTNET_GCHeapHardLimit=0x100000000 \
DOTNET_GCHighMemPercent=0x26 DOTNET_GCPath=./libclrgcexp.so \
dotnet bin/Release/net10.0/GCLargePagesRepro.dll
The repro:
- Creates several
ConcurrentDictionary objects and creates SOH/LOH memory pressure and churn.
- Loops until a time limit expires or corruption is seen.
Expected behavior
Normal use of ConcurrentDictionary should not cause heap corruption under memory pressure when using GCLargePages=1
Actual behavior
Heap corruption and access violations or null reference exceptions. Corruption typically
occurs within minutes of starting:
=== Running: GCRegionRange=4GB, GCHighMemPercent=38%, real huge pages ===
Allocating 2048 huge pages (4096MB)...
Huge pages allocated: 2048 (4096MB)
THP status: [always] madvise never
GCLargePages: 1
GCHighMemPercent: 0x26
=== GCLargePages Heap Corruption Repro ===
Duration: 240 minutes
Dicts: 4, Writers: 8, Readers: 8
Server GC: True, Latency: Interactive
Runtime: .NET 10.0.7
GC Configuration:
GCLargePages = True
GCHighMemPercent = 38
GCRegionRange = 4294967296
GCHeapHardLimit = 4294967296
GCDynamicAdaptationMode = 0
[00:00:05] Entries=669,907 | Heap=274MB | Load=4.0% | GC2=3(+3) | NRE=0 AV=0 OOM=0
[NRE] Reader-3: System.NullReferenceException: Object reference not set to an instance of an object.
at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-1: System.NullReferenceException: Object reference not set to an instance of an object.
at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-7: System.NullReferenceException: Object reference not set to an instance of an object.
at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-2: System.NullReferenceException: Object reference not set to an instance of an object.
at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-5: System.NullReferenceException: Object reference not set to an instance of an object.
at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-4: System.NullReferenceException: Object reference not set to an instance of an object.
at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-0: System.NullReferenceException: Object reference not set to an instance of an object.
at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-0: System.NullReferenceException: Object reference not set to an instance of an object.
at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-5: System.NullReferenceException: Object reference not set to an instance of an object.
at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-6: System.NullReferenceException: Object reference not set to an instance of an object.
at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-6: System.NullReferenceException: Object reference not set to an instance of an object.
at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
[NRE] Reader-5: System.NullReferenceException: Object reference not set to an instance of an object.
at System.Collections.Concurrent.ConcurrentDictionary`2.TryGetValue(TKey key, TValue& value)
at Program.ReaderLoop(Int32 threadId) in /app/Program.cs:line 157
Fatal error.
System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at System.Collections.Concurrent.ConcurrentDictionary`2[[System.__Canon, System.Private.CoreLib, Version=10.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=10.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].TryGetValue(System.__Canon, System.__Canon ByRef)
at Program.ReaderLoop(Int32)
at Program+<>c__DisplayClass9_2.<Main>b__1()
at System.Threading.Thread.StartCallback()
The corruption cascades rapidly and multiple threads fault at the same millisecond,
consistent with a single GC event corrupting a heap region that multiple threads
then read from.
Regression?
No
Known Workarounds
Do not enable GCLargePages
Configuration
- Versions: .NET 10.0.7
- OS: Linux (tested on official Microsoft .NET Docker images)
- Our production servers are on Ubuntu 22.04/kernel 5.15.0
- Arch: x86_64
- Server/Concurrent GC enabled
- Happens both with stock GC and 10.x patched with previous fix: 5713889
DOTNET_GCLargePages=1 with real kernel huge pages
DOTNET_GCHighMemPercent can be used to tune GC behavior to cause corruption more easily
Other information
We discovered this issue while trying to track down a problem we're hitting recently in production when GCLargePages mode is active - we are occasionally getting NullReferenceExceptions related to ConcurrentDictionary where it does not seem possible to throw a NullReferenceException:
System.NullReferenceException: Object reference not set to an instance of an object.
at System.Collections.Concurrent.ConcurrentDictionary`2.GrowTable(Tables tables, Boolean resizeDesired, Boolean forceRehashIfNonRandomized)
We originally reported this as #126903 and one cause of
heap corruption was fixed and backported to .NET 10.
Unfortunately after running with the patched GC from that fix we ran into the issue in production after several weeks, even though that same custom GC resolved the issue in our standalone reproduction program. This issue is an attempt to reproduce the heap corruption we're experiencing without forcing aggressive GCs.
If it is helpful at all, here's is an AI-generated theory on why the previous fix might have been incomplete:
Theory: The bug is in virtual_commit not zeroing on recommit
The most likely root cause is the recommit path, not just the decommit path:
gc.cpp line 7515-7517:
bool commit_succeeded_p = ((h_number >= 0) ? (use_large_pages_p ? true :
virtual_alloc_commit_for_heap (address, size, h_number)) :
GCToOSInterface::VirtualCommit(address, size));
When use_large_pages_p=true, virtual_commit returns true immediately without touching memory. On normal pages, VirtualCommit → mmap(MAP_FIXED) provides zeroed pages. With large pages, the memory retains whatever was there before.
Why the previous fix (5713889) is incomplete
The previous fix addressed decommit_region() (line 45081-45093) which IS properly handled — it explicitly clears to heap_segment_used() for large pages. That fix solved the aggressive GC path.
But there are still dangerous paths:
decommit_heap_segment (line 12657) — decommits a full segment under high memory load with no end_of_data parameter, meaning zero clearing happens on large pages.
- The gap between used and committed — even in the safe
decommit_region, the clearing only goes to heap_segment_used(), NOT heap_segment_committed(). If a recommitted region's new allocations extend past the old used marker, they hit uncleared stale data.
- GC internal metadata — when a region is recycled, the GC resets its bookkeeping pointers but doesn't necessarily zero the plug tree data, brick table entries, or free list threading within the region. During the next GC sweep, stale internal structures could be misinterpreted.
Description
When
DOTNET_GCLargePages=1is enabled, the GC can corrupt heap memory during normal operation without any explicitGC.Collectcalls. Stale object references appear to survive in regions that are "decommitted" (a no-op with large pages) and later reused, causingNullReferenceExceptionandAccessViolationExceptioninConcurrentDictionaryinternals.This appears to be a different issue from #126903 (which required
GCCollectionMode.Aggressive). This bug triggers during normal GC operation under memory pressure (achieved by settingDOTNET_GCHighMemPercentto a lower than default value. No user-initiated/explicit collections are needed. It reproduces with both the stock .NET 10 GC and the patched GC from commit 5713889. We have not yet attempted reproducing with a patched GC that implements #127290Reproduction Steps
Minimal repro program
Minimal reproduction project:
gc-largepages-repro-2.zip
Option 1: Docker (requires
--privilegedfor huge page setup)# Triggers corruption, usually within several minutes: ./run.shOption 2: Run locally on Linux
Requires real kernel huge pages allocated ahead of time (at least 2048 pages = 4GB):
The repro:
ConcurrentDictionaryobjects and creates SOH/LOH memory pressure and churn.Expected behavior
Normal use of
ConcurrentDictionaryshould not cause heap corruption under memory pressure when usingGCLargePages=1Actual behavior
Heap corruption and access violations or null reference exceptions. Corruption typically
occurs within minutes of starting:
The corruption cascades rapidly and multiple threads fault at the same millisecond,
consistent with a single GC event corrupting a heap region that multiple threads
then read from.
Regression?
No
Known Workarounds
Do not enable
GCLargePagesConfiguration
DOTNET_GCLargePages=1with real kernel huge pagesDOTNET_GCHighMemPercentcan be used to tune GC behavior to cause corruption more easilyOther information
We discovered this issue while trying to track down a problem we're hitting recently in production when
GCLargePagesmode is active - we are occasionally gettingNullReferenceExceptionsrelated toConcurrentDictionarywhere it does not seem possible to throw aNullReferenceException:We originally reported this as #126903 and one cause of
heap corruption was fixed and backported to .NET 10.
Unfortunately after running with the patched GC from that fix we ran into the issue in production after several weeks, even though that same custom GC resolved the issue in our standalone reproduction program. This issue is an attempt to reproduce the heap corruption we're experiencing without forcing aggressive GCs.
If it is helpful at all, here's is an AI-generated theory on why the previous fix might have been incomplete:
Theory: The bug is in virtual_commit not zeroing on recommit
The most likely root cause is the recommit path, not just the decommit path:
gc.cpp line 7515-7517:
When
use_large_pages_p=true,virtual_commitreturns true immediately without touching memory. On normal pages, VirtualCommit → mmap(MAP_FIXED) provides zeroed pages. With large pages, the memory retains whatever was there before.Why the previous fix (5713889) is incomplete
The previous fix addressed
decommit_region()(line 45081-45093) which IS properly handled — it explicitly clears toheap_segment_used()for large pages. That fix solved the aggressive GC path.But there are still dangerous paths:
decommit_heap_segment(line 12657) — decommits a full segment under high memory load with noend_of_dataparameter, meaning zero clearing happens on large pages.decommit_region, the clearing only goes toheap_segment_used(), NOTheap_segment_committed(). If a recommitted region's new allocations extend past the old used marker, they hit uncleared stale data.