Re: [PATCH] mm/percpu, memcontrol: Per-memcg-lruvec percpu accounting

From: Joshua Hahn

Date: Mon Mar 30 2026 - 10:58:22 EST

On Mon, 30 Mar 2026 16:21:12 +0200 Michal Hocko <mhocko@xxxxxxxx> wrote:

> On Mon 30-03-26 07:10:10, Joshua Hahn wrote:
> > On Mon, 30 Mar 2026 14:03:29 +0200 Michal Hocko <mhocko@xxxxxxxx> wrote:
> >
> > > On Fri 27-03-26 12:19:35, Joshua Hahn wrote:
> > > > Convert MEMCG_PERCPU_B from a memcg_stat_item to a memcg_node_stat_item
> > > > to give visibility into per-node breakdowns for percpu allocations and
> > > > turn it into NR_PERCPU_B.
> > >
> > > Why do we need/want this?
> >
> > Hello Michal,
> >
> > Thank you for reviewing my patch! I hope you are doing well.
> >
> > You're right, I could have done a better job of motivating the patch.
> > My intent with this patch is to give some more visibility into where
> > memory is physically, once you know which memcg it is in.
>
> Please keep in mind that WHY is very often much more important than HOW
> in the patch so you should always start with the intention and
> justification.

Ack, I'll keep in mind for the future!

> > Percpu memory could probably be seen as "trivial" when it comes to figuring
> > out what node it is on, but I'm hoping to make similar transitions to the
> > rest of enum memcg_stat_item as well (you can see my work for the zswap
> > stats in [1]).
> >
> > When all of the memory is moved from being tracked per-memcg to per-lruvec,
> > then the final vision would be able to attribute node placement within
> > each memcg, which can help with diagnosing things like asymmetric node
> > pressure within a memcg, which is currently only partially accurate.
> >
> > Getting per-node breakdowns of percpu memory orthogonal to memcgs also
> > seems like a win to me. While unlikely, I think that we can benefit from
> > some amount of visibility into whether percpu allocations are happening
> > equally across all CPUs.
> >
> > What do you think? Thank you again, I hope you have a great day!

Thank you for the feedback, Michal. Let me break down your questions so I
can address them one-by-one:

> I think that you should have started with this intended outcome first
> rather than slicing it in pieces. Why do we want to shift to per-node
> stats for other/all counters? What is the cost associated comparing to the

Yup, ack here as well. Here is a bit more context on why I stumbled on this
in the first place. As you are aware, I'm also working on another series
whose goal is to make memory limits tier-aware [2]. While working on this,
I realized that memory in the enum memcg_stat_item had no physical
association, which meant that identifying (1) which node / tier they were on,
and (2) which node / tier the memory should be migrated to was completely
invisible.

That was the original motivation. Looking deeper I found that this is not
even a tier problem but rather just a lack of visibility into node-level
statistics for the user.

As another example, recently I have seen an example of socket memory
landing in CXL, which is really quite strange. (Was it demoted? Was it
through a fallback allocation?) It was only visible after there was an OOM
and I could use the vmcore to inspect the data manually and figure out
the page placement.

I was thinking that it would be very nice to have this level of node-level
perspective along with the memcg association because IMO something like
this has more value in being analyzed at runtime, rather than during a
post-mortem with the vmcore, and there is more we can do by understanding
what was happening at the system when this strange placement happened.

> What is the cost associated comparing to the
> existing accounting (if any)? Please go into details on how do you plan
> to use the data before we commit into a lot of code churn.

For percpu specifically, I think the cost is minimal. Thankfully these
changes also have minimal effects on single-NUMA machines as well.
But let me get some concrete numbers and get back to you so that I can
back these hypotheses up.

> TBH I do not see any fundamental reasons why this would be impossible
> but I am not really sure this is worth the work and I also do not see
> potential subtle issues that we might stumble over when getting there.
> So I would appreciate if you could have a look into that deeper and
> provide us with evaluation on how do you want to achieve your end goal
> and what can we expect on the way. It is, of course, impossible to see
> all potential problems without starting implementing the thing but a
> high level evaluation would be really helpful.

Great to hear that you think this is not impossible ; -)

Yes, I also definitely see that there can be some subtle issues. One thing
I'm trying to be very mindful of is locking semantics, whether we are
introducing any new bottlenecks for updates. I'll do some testing and
come back with numbers, hopefully that can instill some more confidence
with the side effects of these patches.

As a note of concern I do believe that socket memory will be tough
to track accurately since it uses a different model of memory accounting.
I hope that there can be some steps to make it more accurate without
introducing overhead in the socket hotpaths, since those are highly
performance-sensitive.

Another concern is what to do with MEMCG_SWAP, which is not really able
to be associated with a node. But swap is unique in that it genuinely
does not take up the memory in memory. So maybe at the end of all of this
when there is only MEMCG_SWAP in memcg_stat_item, we can treat it as a
single special case.

Thank you for your thoughts Michal, I greatly appreciate them.
I hope you have a great day!
Joshua

> > [1] https://lore.kernel.org/all/20260311195153.4013476-1-joshua.hahnjy@xxxxxxxxx/
[2] https://lore.kernel.org/all/20260223223830.586018-1-joshua.hahnjy@xxxxxxxxx/