Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator

From: Tejun Heo
Date: Sun Feb 22 2009 - 19:44:34 EST


Hello, Ingo.

Ingo Molnar wrote:
> Heck no. It is absolutely crazy to complicate __pa()/__va() in
> _any_ way just to 'save' one more 2MB dTLB.

Are __pa()/__va() that hot paths? Or am I over-estimating the cost of
2MB dTLB?

> We'll use that TLB because that is what TLBs are for: to handle
> mapped pages. Yes, in the percpu scheme we are working on we'll
> have a 'dual' mapping for the static percpu area (on 64-bit) but
> mapping aliases have been one of the most basic CPU features for
> the past 15 years ...
>
> Even a single NOP in the __pa()/__va() path is _more_ expensive
> than that TLB, believe me.

Alright, I'll believe you. That actually works very nice for me. :-)

> Look at last year's cheap quad CPU:
>
> Data TLB: 4MB pages, 4-way associative, 32 entries
>
> That's 32x2MB = 64MB of data reach. Our access patterns in the
> kernel tend to be pretty focused as well, so 32 is more than
> enough in practice.
>
> Especially if the pte is cached a TLB fill is very cheap on
> Intel CPUs. So even if we were trashing those 32 entries (which
> we are generally not), having a dTLB for the percpu area is a
> TLB entry well spent.
>
> So lets just do the most simple and most straightforward mapping
> approach which i suggested: it takes advantage of everything, is
> very close to the best possible performance in the cached case -
> and dont worry about hardware resources.

Alright, for NUMA, I'll just remap a large page. For UMA, I already
wrote code to embed it existing large page nicely, so I'll keep it
that way. The added code is only about 40 lines which is localized in
setup_percpu.c and all __init. The NUMA remap also shouldn't take too
much code if the __pa/__va() trick isn't necessary. I'll post the
patches soon.

> The moment you start worrying about hardware resources on that
> level and start 'optimizing' it in software, you've already lost
> it. It leads down to the path of soft-TLB handlers and other
> sillyness. There's no way you can win such a race against
> hardware fundamentals - at least at today's speed of advance in
> the hw space.

Well, I was hoping for not introducing any performance regression
while converting to new allocator. Performance penalty due to TLB
pressure is especially difficult to measure, so avoiding any addition
there makes accepting the new allocator much easier but I gotta admit
that I'm not an expert at x86 micro performance tuning. If you think
the overhead is acceptable, I'm a happy camper.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/