Skip decommit for large pages and add fake large pages test mode#127290
Merged
janvorli merged 3 commits intoApr 28, 2026
Conversation
Contributor
|
Tagging subscribers to this area: @JulieLeeMSFT, @dotnet/gc |
Member
mangod9
reviewed
Apr 22, 2026
mangod9
reviewed
Apr 22, 2026
VSadov
reviewed
Apr 22, 2026
Member
|
The test fails on x86. Perhaps just make the test incompatible with 32bit? |
This was referenced Apr 22, 2026
Open
6b4e8d4 to
16345c8
Compare
3 tasks
With large pages, VirtualDecommit is a no-op since large pages cannot be partially decommitted. PR dotnet#126929 fixed the resulting stale data corruption by adding memclr in virtual_decommit, but this approach has downsides: the memory is never returned to the OS, yet we pay for the clearing and produce misleading committed/used bookkeeping. Instead, skip the decommit entirely for large pages: 1. distribute_free_regions: skip the aggressive tail-region decommit (the committed-but-unallocated tail of in-use regions). This was the path that caused the heap corruption in dotnet#126903. 2. decommit_heap_segment: skip the whole-segment decommit used for segment hoarding and BGC segment deletion. Same class of issue: committed/used are lowered but physical memory retains stale data. 3. decommit_region: bypass virtual_decommit and call reduce_committed_bytes directly, since decommit_region already handles large pages correctly by clearing memory itself. 4. virtual_decommit: add an assert that it is never called for heap memory when large pages are on. This catches any future caller that forgets to handle the large pages case. The end_of_data parameter and no-op ternary added by dotnet#126929 are removed. Add GCLargePages=2 mode that simulates large pages using small pages: sets use_large_pages_p=true but reserves with normal pages and commits everything upfront. This exercises all large page GC code paths without requiring OS large page setup or privileges, enabling CI testing. Fix dotnet#126903
Address review feedback from mangod9 and janvorli.
Rename large_pages_fake_mode_p to large_pages_emulation_mode_p and update comments to use emulation terminology throughout. Disable test on 32-bit: GCHeapHardLimit=0xC0000000 exceeds the virtual address space and GCLargePages is gated by HOST_64BIT.
a13a490 to
80b8740
Compare
pavelsavara
reviewed
Apr 28, 2026
JulieLeeMSFT
pushed a commit
that referenced
this pull request
Apr 29, 2026
Fixes #126903 ## Customer Impact - [x] Customer reported - [ ] Found internally GC heap corruption when `DOTNET_GCLargePages=1` is enabled on Linux (#126903). . Reproducible by calling `GC.Collect(2, GCCollectionMode.Aggressive, true, true)` with large pages enabled, but also occurs in normal production workloads without aggressive GC. ## Regression - [ ] Yes - [x] No This is a pre-existing bug in the GC's large-page decommit logic. When `GCLargePages` is enabled, the GC skips OS-level decommits but still updates bookkeeping as if the decommit succeeded. This causes regions to be reused without being zeroed, leading to heap corruption. The bug has existed since Regions was enabled. ## Testing The fix was validated by the customer against their production workload. ## Risk Low. The fix clears decommitted memory in the large-pages scenario to ensure regions are properly zeroed before reuse. This is a targeted change to the GC's decommit path that only affects `GCLargePages=1` configurations. The larger fix #127290 is made in .NET 11
MichalStrehovsky
added a commit
that referenced
this pull request
May 2, 2026
NativeAOT's `RhConfig::Environment::TryGetIntegerValue` had a
hand-rolled hex parser that rejected the `0x`/`0X` prefix — returning a
parse error when it encountered `x`. This meant env vars like
`DOTNET_GCHeapHardLimit=0xC0000000` silently failed to parse, leaving
the hard limit unset. With `GCLargePages=2` also set, the GC would then
return `CLR_E_GC_LARGE_PAGE_MISSING_HARD_LIMIT` and fail initialization.
CoreCLR's equivalent uses `strtoul(..., 16)` which handles the prefix
natively.
## Description
- **`src/coreclr/nativeaot/Runtime/RhConfig.cpp`** — In
`TryGetIntegerValue`, skip a leading `0x`/`0X` prefix when parsing in
hex mode, before entering the digit loop. Additionally, return `false`
(parse error) when the value is exactly `"0x"` or `"0X"` with no hex
digits following the prefix, matching CoreCLR's `strtoul` behavior:
```cpp
uint32_t startIndex = 0;
if (!decimal && cchResult >= 2 && buffer[0] == '0' && (buffer[1] == 'x' || buffer[1] == 'X'))
{
startIndex = 2;
if (startIndex == cchResult)
return false; // parse error - hex prefix without any digits
}
for (uint32_t i = startIndex; i < cchResult; i++)
```
This aligns NativeAOT's config parsing with CoreCLR's `strtoul`-based
behavior and fixes the `Collect_Aggressive_LargePages` test failure
under NativeAOT.
<!-- START COPILOT ORIGINAL PROMPT -->
<details>
<summary>Original prompt</summary>
## Problem
NativeAOT's `RhConfig::Environment::TryGetIntegerValue` in
`src/coreclr/nativeaot/Runtime/RhConfig.cpp` uses a hand-rolled hex
parser that does not handle the `0x` or `0X` prefix. This causes config
values like `DOTNET_GCHeapHardLimit=0xC0000000` to fail to parse,
because when the parser encounters the `x` character it returns `false`
(parse error).
CoreCLR's equivalent code (`CLRConfigNoCache::TryAsInteger` in
`src/coreclr/inc/clrconfignocache.h`) uses `strtoul(_value, &endPtr,
radix)` which natively handles the `0x` prefix when radix is 16.
This causes the test `Collect_Aggressive_LargePages` added in PR #127290
to fail under NativeAOT: the `GCHeapHardLimit` fails to parse, so no
hard limit is set, but `GCLargePages=2` succeeds → the GC returns
`CLR_E_GC_LARGE_PAGE_MISSING_HARD_LIMIT` and the process exits with -1.
## Fix
In `src/coreclr/nativeaot/Runtime/RhConfig.cpp`, in the
`TryGetIntegerValue` method, when parsing in hex mode (i.e., `decimal`
is false), skip a leading `0x` or `0X` prefix before entering the
digit-parsing loop. This matches the behavior of `strtoul` with radix 16
that CoreCLR uses.
Specifically, after reading the environment variable into `buffer` and
before the parsing loop, add:
```cpp
uint32_t startIndex = 0;
if (!decimal && cchResult >= 2 && buffer[0] == '0' && (buffer[1] == 'x' || buffer[1] == 'X'))
{
startIndex = 2;
}
```
Then change the loop from `for (uint32_t i = 0; ...)` to `for (uint32_t
i = startIndex; ...)`.
The following is the prior conversation context from the user's chat
exploration (may be truncated):
User: ```
16:22:36.657 Running test:
GC\API\GC\Collect_Aggressive_LargePages\Collect_Aggressive_LargePages.cmd
Return code: 1
Raw output file:
C:\h\w\B1EC0A05\w\B5C209C7\uploads\API\GC\Collect_Aggressive_LargePages\output.txt
Raw output:
BEGIN EXECUTION
call C:\h\w\B1EC0A05\p\nativeaottest.cmd
C:\h\w\B1EC0A05\w\B5C209C7\e\GC\API\GC\Collect_Aggressive_LargePages\
Collect_Aggressive_LargePages.dll
Expected: 100
Actual: -1
END EXECUTION - FAILED
FAILED
Test Harness Exitcode is : 1
To run the test:
Set up CORE_ROOT and run.
>
C:\h\w\B1EC0A05\w\B5C209C7\e\GC\GC\native\..\..\API\GC\Collect_Aggressive_LargePages\Collect_Aggressive_LargePages.cmd
Xunit.Sdk.TrueException:
Return code: 1
Raw output file:
C:\h\w\B1EC0A05\w\B5C209C7\uploads\API\GC\Collect_Aggressive_LargePages\output.txt
Raw output:
BEGIN EXECUTION
call C:\h\w\B1EC0A05\p\nativeaottest.cmd
C:\h\w\B1EC0A05\w\B5C209C7\e\GC\API\GC\Collect_Aggressive_LargePages\
Collect_Aggressive_LargePages.dll
Expected: 100
Actual: -1
END EXECUTION - FAILED
FAILED
Test Harness Exitcode is : 1
To run the test:
Set up CORE_ROOT and run.
>
C:\h\w\B1EC0A05\w\B5C209C7\e\GC\GC\native\..\..\API\GC\Collect_Aggressive_LargePages\Collect_Aggressive_LargePages.cmd
at Xunit.Assert.True(Nullable`1, String)
at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String, String)
at Program.<<Main>$>g__TestExecutor26|0_27(StreamWriter, StreamWriter,
Program.<>c__DisplayClass0_0&)
16:22:36.755 Failed test:
GC\API\GC\Collect_Aggressive_LargePages\Collect_Aggressive_LargePages.cmd
```
the test added here is failing with native AOT. why?
Assistant: The test fails on NativeAOT because **NativeAOT's config reader cannot parse the `0x` prefix** in the `DOTNET_GCHeapHardLimit=0xC0000000` environment variable.
Looking at NativeAOT's `RhConfig::Environment::TryGetIntegerValue` in:
```c++ name=src/coreclr/nativeaot/Runtime/RhConfig.cpp
url=https://github.com/dotnet/runtime/blob/bfaabae213cf02d343846d253b7149a03819680d/src/coreclr/nativeaot/Runtime/RhConfig.cpp#L58-L84
uint64_t uiResult = 0;
for (uint32_t i = 0; i < cchResult; i++)
{
TCHAR ch = buffer[i];
// ...
else
{
uiResult *= 16;
if ((ch >= '0') && (ch <= '9'))
uiResult += ch - '0';
else if ((ch >= 'a') && (ch <= 'f'))
uiResult += (ch - 'a') + 10;
else if ((ch >= 'A') && (ch <= 'F'))
uiResult += (ch - 'A') + 10;
else
return false; // parse error <-- 'x' hits this
}
}
```
When the value is `0xC0000000`, the parser reads hex digits character-by-character. It handles `0`, but then encounters `x`, which isn't a valid hex digit — so it **returns `false`** (parse error). This means `GCHeapHardLimit` is never set.
**The cascade:**
1. `DOTNET_GCHeapHardLimit=0xC0000000` fails to parse → no hard limit is set
2. `DOTNET_GCLargePages=2` is read successfully (it's a simple `2`, valid hex) → `use_large_pages_p = true`
3. The GC hits this check in [`interface.cpp`](https://github.com/dotnet/runtime/blob/bfaabae213cf02d343846d253b7149a03819680d/src/coreclr/gc/interface.cpp#L276-L278): `if ((!gc_heap::heap_hard_limit) && gc_heap::use_large_pages_p)` → returns `CLR_E_GC_LARGE_PAGE_MISSING_HARD_LIMIT`
4. GC initialization fails →...
</details>
<!-- START COPILOT CODING AGENT SUFFIX -->
*This pull request was created from Copilot chat.*
>
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: MichalStrehovsky <13110571+MichalStrehovsky@users.noreply.github.com>
Co-authored-by: Michal Strehovský <MichalStrehovsky@users.noreply.github.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
With large pages, VirtualDecommit is a no-op since large pages cannot be partially decommitted. PR #126929 fixed the resulting stale data corruption by adding memclr in virtual_decommit, but this approach has downsides: the memory is never returned to the OS, yet we pay for the clearing and produce misleading committed/used bookkeeping.
Instead, skip the decommit entirely for large pages:
distribute_free_regions: skip the aggressive tail-region decommit (the committed-but-unallocated tail of in-use regions). This was the path that caused the heap corruption in GC heap corruption with GCLargePages #126903.
decommit_heap_segment: skip the whole-segment decommit used for segment hoarding and BGC segment deletion. Same class of issue: committed/used are lowered but physical memory retains stale data.
decommit_region: bypass virtual_decommit and call reduce_committed_bytes directly, since decommit_region already handles large pages correctly by clearing memory itself.
virtual_decommit: add an assert that it is never called for heap memory when large pages are on. This catches any future caller that forgets to handle the large pages case. The end_of_data parameter and no-op ternary added by fix for largepages with agressive decommit logic #126929 are removed.
Add GCLargePages=2 mode that simulates large pages using small pages: sets use_large_pages_p=true but reserves with normal pages and commits everything upfront. This exercises all large page GC code paths without requiring OS large page setup or privileges, enabling CI testing.
Fix #126903