[release/10.0] Fix heap_segment_used watermark after compaction#128342
[release/10.0] Fix heap_segment_used watermark after compaction#128342janvorli wants to merge 1 commit into
Conversation
After compact_phase, heap_segment_used can be stale — lower than the actual end of live data - because `plan_phase` sets `plan_allocated` beyond used for regions that receive relocated objects. When `decommit_region` later clears memory only up to used instead of committed (the large-pages / never_decommit_p path), the gap between used and plan_allocated retains dirty data from a previous region lifetime, causing heap corruption on the next GC cycle. Fix: At the end of `compact_phase`, bump heap_segment_used to `max(used, plan_allocated)` for every non-read-only region in the condemned generations and one generation above (the maximum compaction target range). The fix cost is zero when no compaction occurs. When compaction does occur, it avoids unnecessary `memclr` in `decommit_region` by keeping the used watermark accurate, so only truly unused memory is cleared.
|
Tagging subscribers to this area: @JulieLeeMSFT, @dotnet/gc |
There was a problem hiding this comment.
Pull request overview
This backport addresses a GC regions + large pages correctness issue where stale heap_segment_used values after compaction could cause decommit_region to clear an insufficient range, allowing dirty memory to be reused and leading to heap corruption/crashes.
Changes:
- In
compact_phase(regions mode), updates each affected region’sheap_segment_usedto covermax(used, plan_allocated)for condemned generations and one generation above. - In
decommit_heap_segment(regions mode), skips decommitting when large pages are enabled to avoid incorrect “logical decommit” behavior (large-page decommit is a no-op).
|
cc: @BenV |
Just wanted to confirm that this appears to resolve the issue on .NET 10 as expected. I'll keep running the tests for the full 4 hours just in case and report back. Thanks again for all your support @janvorli @cshung @mangod9, we really appreciate it! Edit: The stress tests passed full 4 hour runs on both of my test machines! |
|
Fixes #127687. |
|
@janvorli, once it is code reviewed, we can merge. |
| { | ||
| return; | ||
| } | ||
|
|
There was a problem hiding this comment.
This block is not in the net11 fix. Is it a part of something else?
Backport of #128217 to release/10.0
Customer Impact
GC with large pages enabled in regions mode can lead to intermittent crashes due to non-zeroed memory being returned for an allocation request that expects the memory to be zeroed.
Regression
Testing
CI tests, local testing using targeted repro app from the customer, GC tests
Risk
Low. It adds maintaining
heap_segment_usedwatermark after compaction so that it covers all the touched memory in a region. Before this change, it was stale (lower) for regions that receive relocated objects.