Skip to content

Commit 19c5fb8

Browse files
hnazakpm00
authored andcommitted
mm: page_alloc: avoid kswapd thrashing due to NUMA restrictions
On NUMA systems without bindings, allocations check all nodes for free space, then wake up the kswapds on all nodes and retry. This ensures all available space is evenly used before reclaim begins. However, when one process or certain allocations have node restrictions, they can cause kswapds on only a subset of nodes to be woken up. Since kswapd hysteresis targets watermarks that are *higher* than needed for allocation, even *unrestricted* allocations can now get suckered onto such nodes that are already pressured. This ends up concentrating all allocations on them, even when there are idle nodes available for the unrestricted requests. This was observed with two numa nodes, where node0 is normal and node1 is ZONE_MOVABLE to facilitate hotplugging: a kernel allocation wakes kswapd on node0 only (since node1 is not eligible); once kswapd0 is active, the watermarks hover between low and high, and then even the movable allocations end up on node0, only to be kicked out again; meanwhile node1 is empty and idle. Similar behavior is possible when a process with NUMA bindings is causing selective kswapd wakeups. To fix this, on NUMA systems augment the (misleading) watermark test with a check for whether kswapd is already active during the first iteration through the zonelist. If this fails to place the request, kswapd must be running everywhere already, and the watermark test is good enough to decide placement. With this patch, unrestricted requests successfully make use of node1, even while kswapd is reclaiming node0 for restricted allocations. [gourry@gourry.net: don't retry if no kswapds were active] Link: https://lkml.kernel.org/r/20250919162134.1098208-1-hannes@cmpxchg.org Signed-off-by: Gregory Price <gourry@gourry.net> Tested-by: Joshua Hahn <joshua.hahnjy@gmail.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1 parent fde591d commit 19c5fb8

File tree

1 file changed

+24
-0
lines changed

1 file changed

+24
-0
lines changed

mm/page_alloc.c

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3735,6 +3735,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
37353735
struct pglist_data *last_pgdat = NULL;
37363736
bool last_pgdat_dirty_ok = false;
37373737
bool no_fallback;
3738+
bool skip_kswapd_nodes = nr_online_nodes > 1;
3739+
bool skipped_kswapd_nodes = false;
37383740

37393741
retry:
37403742
/*
@@ -3797,6 +3799,19 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
37973799
}
37983800
}
37993801

3802+
/*
3803+
* If kswapd is already active on a node, keep looking
3804+
* for other nodes that might be idle. This can happen
3805+
* if another process has NUMA bindings and is causing
3806+
* kswapd wakeups on only some nodes. Avoid accidental
3807+
* "node_reclaim_mode"-like behavior in this case.
3808+
*/
3809+
if (skip_kswapd_nodes &&
3810+
!waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
3811+
skipped_kswapd_nodes = true;
3812+
continue;
3813+
}
3814+
38003815
cond_accept_memory(zone, order, alloc_flags);
38013816

38023817
/*
@@ -3888,6 +3903,15 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
38883903
}
38893904
}
38903905

3906+
/*
3907+
* If we skipped over nodes with active kswapds and found no
3908+
* idle nodes, retry and place anywhere the watermarks permit.
3909+
*/
3910+
if (skip_kswapd_nodes && skipped_kswapd_nodes) {
3911+
skip_kswapd_nodes = false;
3912+
goto retry;
3913+
}
3914+
38913915
/*
38923916
* It's possible on a UMA machine to get through all zones that are
38933917
* fragmented. If avoiding fragmentation, reset and try again.

0 commit comments

Comments
 (0)