[RFC PATCH 0/6] Use local_lock for pcp protection and reduce stat overhead

From: Mel Gorman
Date: Mon Mar 29 2021 - 08:07:42 EST


This series requires patches in Andrew's tree so the series is also
available at

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r15

tldr: Jesper and Chuck, it would be nice to verify if this series helps
the allocation rate of the bulk page allocator. RT people, this
*partially* addresses some problems PREEMPT_RT has with the page
allocator but it needs review.

The PCP (per-cpu page allocator in page_alloc.c) share locking requirements
with vmstat which is inconvenient and causes some issues. Possibly because
of that, the PCP list and vmstat share the same per-cpu space meaning that
it's possible that vmstat updates dirty cache lines holding per-cpu lists
across CPUs unless padding is used. The series splits that structure and
separates the locking.

Second, PREEMPT_RT considers the following sequence to be unsafe
as documented in Documentation/locking/locktypes.rst

local_irq_disable();
spin_lock(&lock);

The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
-> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). This series explicitly
separates the locking requirements for the PCP list (local_lock) and stat
updates (irqs disabled). Once that is done, the length of time IRQs are
disabled can be reduced and in some cases, IRQ disabling can be replaced
with preempt_disable.

After that, it was very obvious that zone_statistics in particular has way
too much overhead and leaves IRQs disabled for longer than necessary. It
has perfectly accurate counters requiring IRQs be disabled for parallel
RMW sequences when inaccurate ones like vm_events would do. The series
makes the NUMA statistics (NUMA_HIT and friends) inaccurate counters that
only require preempt be disabled.

Finally the bulk page allocator can then do all the stat updates in bulk
with IRQs enabled which should improve the efficiency of the bulk page
allocator. Technically, this could have been done without the local_lock
and vmstat conversion work and the order simply reflects the timing of
when different series were implemented.

No performance data is included because despite the overhead of the
stats, it's within the noise for most workloads but Jesper and Chuck may
observe a significant different with the same tests used for the bulk
page allocator. The series is more likely to be interesting to the RT
folk in terms of slowing getting the PREEMPT tree into mainline.

drivers/base/node.c | 18 +--
include/linux/mmzone.h | 29 +++--
include/linux/vmstat.h | 65 ++++++-----
mm/mempolicy.c | 2 +-
mm/page_alloc.c | 173 ++++++++++++++++------------
mm/vmstat.c | 254 +++++++++++++++--------------------------
6 files changed, 254 insertions(+), 287 deletions(-)

--
2.26.2