Re: regarding the x86_64 zero-based percpu patches

From: Eric W. Biederman
Date: Mon Jan 12 2009 - 12:48:22 EST


Christoph Lameter <cl@xxxxxxxxxxxxxxxxxxxx> writes:

> On Sat, 10 Jan 2009, Rusty Russell wrote:
>
>> > As I was trying to do more stuff per-cpu
>> > (not putting a lot of stuff into per-cpu area but even with small
>> > things limited per-cpu area poses scalability problems), cpu_alloc
>> > seems to fit the bill better.
>>
>> Unfortunately cpu_alloc didn't solve this problem either.
>>
>> We need to grow the areas, but for NUMA layouts it's non-trivial. I don't
>> like the idea of remapping: one TLB entry per page per cpu is going to suck.
>> Finding pages which are "congruent" with the original percpu pages is more
>> promising, but it will almost certainly need to elbow pages out the way to
>> have a chance of succeeding on a real system.
>
> An allocation automatically falls back to the nearest node on NUMA
> cpu_to_node() gives you the current node.
>
> There are 2M TLB entries on x86_64. If we really get into a high usage
> scenario then the 2M entry makes sense. Average server memory sizes likely
> already are way beyond 10G per box. The higher that goes the more
> reasonable the 2M TLB entry will be.

2M of per cpu data doesn't make sense, and likely indicates a design
flaw somewhere. It just doesn't make sense to have large amounts of
data allocated per cpu.

The most common user of per cpu data I am aware of is allocating one
word per cpu for counters.

What would be better is simply to:
- Require a lock to access another cpus per cpu data.
- Do large page allocations for the per cpu data.

At which point we could grow the per cpu data by simply reallocating it on
each cpu and updating the register that holds the base pointer.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/