Re: [PATCH v2] mm/percpu, memcontrol: Per-memcg-lruvec percpu accounting

From: Joshua Hahn

Date: Tue Apr 07 2026 - 23:40:34 EST

On Wed, 8 Apr 2026 11:40:27 +0900 "Harry Yoo (Oracle)" <harry@xxxxxxxxxx> wrote:

> On Fri, Apr 03, 2026 at 08:38:43PM -0700, Joshua Hahn wrote:
> > enum memcg_stat_item includes memory that is tracked on a per-memcg
> > level, but not at a per-node (and per-lruvec) level. Diagnosing
> > memory pressure for memcgs in multi-NUMA systems can be difficult,
> > since not all of the memory accounted in memcg can be traced back
> > to a node. In scenarios where numa nodes in an memcg are asymmetrically
> > stressed, this difference can be invisible to the user.
> >
> > Convert MEMCG_PERCPU_B from a memcg_stat_item to a memcg_node_stat_item
> > to give visibility into per-node breakdowns for percpu allocations.
> >
> > This will get us closer to being able to know the memcg and physical
> > association of all memory on the system. Specifically for percpu, this
> > granularity will help demonstrate footprint differences on systems with
> > asymmetric NUMA nodes.
> >
> > Because percpu memory is accounted at a sub-PAGE_SIZE level, we must
> > account node level statistics (accounted in PAGE_SIZE units) and
> > memcg-lruvec statistics separately. Account node statistics when the pcpu
> > pages are allocated, and account memcg-lruvec statistics when pcpu
> > objects are handed out.
> >
> > To do account these separately, expose mod_memcg_lruvec_state to be
> > used outside of memcontrol.
> >
> > The memory overhead of this patch is small; it adds 16 bytes
> > per-cgroup-node-cpu. For an example machine with 200 CPUs split across
> > 2 nodes and 50 cgroups in the system, we see a 312.5 kB increase. Note
> > that this is the same cost as any other item in memcg_node_stat_item.
> >
> > Performance impact is also negligible. These are results from a kernel
> > module which performs 100k percpu allocations via __alloc_percpu_gfp
> > with GFP_KERNEL | __GFP_ACCOUNT in a cgroup, across 20 trials.
> > Batched performs 100k allocations followed by 100k frees, while
> > interleaved performs allocation --> free --> allocation ...
> >
> > +-------------+----------------+--------------+--------------+
> > | Test | linus-upstream | patch | diff |
> > +-------------+----------------+--------------+--------------+
> > | Batched | 6586 +/- 51 | 6595 +/- 35 | +9 (0.13%) |
> > | Interleaved | 1053 +/- 126 | 1085 +/- 113 | +32 (+0.85%) |
> > +-------------+----------------+--------------+--------------+
> >
> > One functional change is that there can be a tiny inconsistency between
> > the size of the allocation used for memcg limit checking and what is
> > charged to each lruvec due to dropping fractional charges when rounding.
> > In reality this value is very very small and always lies on the side of
> > memory checking at a higher threshold, so there is no behavioral change
> > from userspace.
> >
> > Signed-off-by: Joshua Hahn <joshua.hahnjy@xxxxxxxxx>
> > ---
> > include/linux/memcontrol.h | 4 +++-
> > include/linux/mmzone.h | 4 +++-
> > mm/memcontrol.c | 12 +++++-----
> > mm/percpu-vm.c | 14 ++++++++++--
> > mm/percpu.c | 45 ++++++++++++++++++++++++++++++++++----
> > mm/vmstat.c | 1 +
> > 6 files changed, 66 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> > index 4f5937090590d..e36b639f521dd 100644
> > --- a/mm/percpu-vm.c
> > +++ b/mm/percpu-vm.c
> > @@ -65,6 +66,10 @@ static void pcpu_free_pages(struct pcpu_chunk *chunk,
> > __free_page(page);
> > }
> > }
> > +
> > + for_each_node(nid)
> > + mod_node_page_state(NODE_DATA(nid), NR_PERCPU_B,
> > + -1L * nr_pages * nr_cpus_node(nid) * PAGE_SIZE);
>
> Can this end up with mis-accounting due to CPU hotplug?

Hey Harry, thanks for giving this patch a look!

Yes, definitely. I think the solution is just to charge based on possible
CPUs, even if that might lead to some inaccuracy (by however many CPUs
aren't online at that moment). Seems like that's what already happens
in memcg anyways, so I think this discrepancy is OK to tolerate.

Will spin up a v3! Thanks a lot, Harry! Have a great day : -)
Joshua