Re: [PATCH -V2] mm: fix draining PCP of remote zone
From: Andrew Morton
Date: Mon Oct 09 2023 - 20:42:19 EST
On Sat, 7 Oct 2023 14:23:56 +0800 Huang Ying <ying.huang@xxxxxxxxx> wrote:
> If there is no memory allocation/freeing in the PCP (Per-CPU Pageset)
> of a remote zone (zone in remote NUMA node) after some time (3 seconds
> for now), the pages of the PCP of the remote zone will be drained to
> avoid memory wastage.
>
> This behavior was introduced in the commit 4ae7c03943fc ("[PATCH]
> Periodically drain non local pagesets") and the commit
> 4037d452202e ("Move remote node draining out of slab allocators")
>
> But, after the commit 7cc36bbddde5 ("vmstat: on-demand vmstat workers
> V8"), the vmstat updater worker which is used to drain the PCP of
> remote zones may not be re-queued when we are waiting for the
> timeout (pcp->expire != 0) if there are no vmstat changes on this CPU,
> for example, when the CPU goes idle or runs user space only workloads.
> This may cause the pages of a remote zone be kept in PCP of this CPU
> for long time. So that, the page reclaiming of the remote zone may be
> triggered prematurely. This isn't a severe problem in practice,
> because the PCP of the remote zone will be drained if some memory are
> allocated/freed again on this CPU. And, the PCP will eventually be
> drained during the direct reclaiming if necessary.
>
> Anyway, the problem still deserves a fix via guaranteeing that the
> vmstat updater worker will always be re-queued when we are waiting for
> the timeout. In effect, this restores the original behavior before
> the commit 7cc36bbddde5.
>
> We can reproduce the bug via allocating/freeing pages from a remote
> zone then go idle as follows. And the patch can fix it.
>
> - Run some workloads, use `numactl` to bind CPU to node 0 and memory to
> node 1. So the PCP of the CPU on node 0 for zone on node 1 will be
> filled.
>
> - After workloads finish, idle for 60s
>
> - Check /proc/zoneinfo
>
> With the original kernel, the number of pages in the PCP of the CPU on
> node 0 for zone on node 1 is non-zero after idle. With the patched
> kernel, it becomes 0 after idle. That is, we avoid to keep pages in
> the remote PCP during idle.
>
Thanks, I updated the changelog in place and queued this for mm-stable.