Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator

From: Ingo Molnar
Date: Sun Feb 22 2009 - 14:38:44 EST

Next message: Frederic Weisbecker: "Re: [TIP] BUG kmalloc-4096: Poison overwritten (ath5k_rx_skb_alloc)"
Previous message: Thomas Gleixner: "Re: [Bug #12667] Badness at kernel/time/timekeeping.c:98 in pmud(timekeeping_suspended)"
In reply to: Tejun Heo: "Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator"
Next in thread: Tejun Heo: "Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Tejun Heo <tj@xxxxxxxxxx> wrote:

> Tejun Heo wrote:
> > I can remove the TLB problem from non-NUMA case but for NUMA I still
> > don't have a good idea. Maybe we need to accept the overhead for
> > NUMA? I don't know.
>
> Hmmmm... one thing we can do on NUMA is to remap and free the
> remapped address and make __pa() and __va() handle that area
> specially. It's a bit convoluted but the added overhead
> should be minimal. It'll only be simple range check in
> __pa()/__va() and it's not like they are super hot paths
> anyway. I'll give it a shot.

Heck no. It is absolutely crazy to complicate __pa()/__va() in
_any_ way just to 'save' one more 2MB dTLB.

We'll use that TLB because that is what TLBs are for: to handle
mapped pages. Yes, in the percpu scheme we are working on we'll
have a 'dual' mapping for the static percpu area (on 64-bit) but
mapping aliases have been one of the most basic CPU features for
the past 15 years ...

Even a single NOP in the __pa()/__va() path is _more_ expensive
than that TLB, believe me.

Look at last year's cheap quad CPU:

Data TLB: 4MB pages, 4-way associative, 32 entries

That's 32x2MB = 64MB of data reach. Our access patterns in the
kernel tend to be pretty focused as well, so 32 is more than
enough in practice.

Especially if the pte is cached a TLB fill is very cheap on
Intel CPUs. So even if we were trashing those 32 entries (which
we are generally not), having a dTLB for the percpu area is a
TLB entry well spent.

So lets just do the most simple and most straightforward mapping
approach which i suggested: it takes advantage of everything, is
very close to the best possible performance in the cached case -
and dont worry about hardware resources.

The moment you start worrying about hardware resources on that
level and start 'optimizing' it in software, you've already lost
it. It leads down to the path of soft-TLB handlers and other
sillyness. There's no way you can win such a race against
hardware fundamentals - at least at today's speed of advance in
the hw space.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Frederic Weisbecker: "Re: [TIP] BUG kmalloc-4096: Poison overwritten (ath5k_rx_skb_alloc)"
Previous message: Thomas Gleixner: "Re: [Bug #12667] Badness at kernel/time/timekeeping.c:98 in pmud(timekeeping_suspended)"
In reply to: Tejun Heo: "Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator"
Next in thread: Tejun Heo: "Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]