Re: [PATCH 2/8] mm: percpu: charge obj_exts allocation with __GFP_ACCOUNT

From: Alexandre Ghiti

Date: Tue May 26 2026 - 04:35:41 EST



On 5/22/26 10:11, Alexandre Ghiti wrote:
Hi Shakeel,

On 5/21/26 19:25, Shakeel Butt wrote:
On Mon, May 11, 2026 at 10:20:37PM +0200, Alexandre Ghiti wrote:
This is a preparatory patch for upcoming per-memcg-per-node kmem
accounting.

pcpu allocations are always fully charged at once using
pcpu_obj_full_size(), which returns the size of the pcpu "metadata" +
pcpu "payload". But metadata and payload may not be allocated on the
same numa node, so charge the metadata independently from the payload.

Do this by explicitly passing __GFP_ACCOUNT to the obj_exts allocation
and remove its accounting in pcpu_memcg_pre_alloc_hook().
Will all the entries in obj_exts array be for the same memcg? If not then why we
are charging the whole array to the one which happen to allocate the array?


Hmm, I overlooked the amount allocated, so that's my mistake: the chunk-allocating-memcg will be charged for all the metadata, although before the charge was distributed. And according to Claude, the metadata would represent 64kB, so not negligible.


I realize that I did not mention my setup: I have been testing this series on a 176 core machine, and the 64KB that Claude gave me was based on a 32K unit_size. But actually it's not right. Here is my understanding:

- unit_size is 512K on this machine, which means that each cpu gets a region this size every time a new chunk is allocated => 176 * 512 = 88MB per chunk

- obj_exts = unit_size / PCPU_MIN_ALLOC_SIZE * sizeof(pcpuobj_ext) = 512K * 2  = 1MB (obj_exts is one memcg pointer for each 4B)

Let me know what you think, but I don't think that's acceptable, I'm looking into another solution.

Thanks

Alex




Sorry I don't know the details of percpu allocator, so asking some dumb
questions:

1. Does the alloc_percpu() (& similar functions) allocate the underlying on a
    single node or does it allocate memory for each cpu on their local node?
    For slub, it is on the same node, so the situation is easier to handle.


To me, chunk metadata and actual pages are allocated differently:

- pcpu_alloc_pages() tries to allocate the pages on the cpu local node https://elixir.bootlin.com/linux/v7.0.9/source/mm/percpu-vm.c#L95. But to me no guarantee it won't fallback to any other node. And I don't think that __GFP_THISNODE would be a good idea here.

- pcpu_alloc_chunk() uses kmalloc or vmalloc depending on the size, so not attached to specific node, that's why I wanted GFP_ACCOUNT to do the job for us in the first place.



2. On a typical system how much memory is consumed by obj_exts for the percpu
    allocator chunks? I am wondering if we don't charge it, how much will we
    loose?


So according to my previous answer, 64kB. I have just noticed that a bunch of dynamically allocated chunk fields are not accounted either, which again according to Claude represent 2.3kB. I don't have much experience in accounting but that's far from negligible right? Which amount are we keen to lose to make the code simpler (or for other reasons)?



3. What would be side effect on assuming that obj_exts is on the same node as
    the given chunk?


Given the size of obj_exts, overcharging one node while undercharging others?

To conclude, you're right, I did not dive deep enough into the metadata sizes, I'll fix that.

Thanks,

Alex