Re: pcpu allocator on large NUMA machines
From: Michael Ellerman
Date: Mon Jul 24 2017 - 21:26:15 EST
Michal Hocko <mhocko@xxxxxxxxxx> writes:
> On Mon 24-07-17 09:57:14, Tejun Heo wrote:
>> On Mon, Jul 24, 2017 at 03:42:40PM +0200, Michal Hocko wrote:
> [...]
>> > My understanding of the pcpu allocator is basically close to zero but it
>> > seems weird to me that we would need many TB of vmalloc address space
>> > just to allocate vmalloc areas that are in range of hundreds of MB. So I
>> > am wondering whether this is an expected behavior of the allocator or
>> > there is a problem somwehere else.
>>
>> It's not actually using the entire region but the area allocations try
>> to follow the same topology as kernel linear address layouts. ie. if
>> kernel address for different NUMA nodes are apart by certain amount,
>> the percpu allocator tries to replicate that for dynamic allocations
>> which allows leaving the static and first dynamic area in the kernel
>> linear address which helps reducing TLB pressure.
>>
>> This optimization can be turned off when vmalloc area isn't spacious
>> enough by using pcpu_page_first_chunk() instead of
>> pcpu_embed_first_chunk() while initializing percpu allocator.
>
> Thanks for the clarification, this is really helpful!
>
>> Can you
>> see whether replacing that in arch/powerpc/kernel/setup_64.c fixes the
>> issue? If so, all it needs to do is figuring out what conditions we
>> need to check to opt out of embedding the first chunk. Note that x86
>> 32bit does about the same thing.
>
> Hmm, I will need some help from PPC guys here. I cannot find something
> ready to implement pcpup_populate_pte and I am not familiar with ppc
> memory model to implement one myself.
I don't think we want to stop using embed first chunk unless we have to.
We have code that accesses percpu variables in real mode (with the MMU
off), and that wouldn't work easily if the first chunk wasn't in the
linear mapping. So it's not just an optimisation for us.
We can fairly easily make the vmalloc space 56T, and I'm working on a
patch to make it ~500T on newer machines.
cheers