Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

From: Linus Torvalds
Date: Fri Nov 04 2005 - 11:41:36 EST




On Fri, 4 Nov 2005, Martin J. Bligh wrote:
>
> > So I suspect Martin's 25% is a lot more accurate on modern hardware (which
> > means x86, possibly Power. Nothing else much matters).
>
> It was PPC64, if that helps.

Ok. I bet x86 is even better, but Power (and possibly itanium) is the only
other architecture that comes close.

I don't like the horrible POWER hash-tables, but for static workloads they
should perform almost as well as a sane page table (I say "almost",
because I bet that the high-performance x86 vendors have spent a lot more
time on tlb latency than even IBM has). My dislike for them comes from the
fact that they are really only optimized for static behaviour.

(And HPC is almost always static wrt TLB stuff - big, long-running
processes).

> Well, I think it depends on the workload a lot. However fast your TLB is,
> if we move from "every cacheline read requires is a TLB miss" to "every
> cacheline read is a TLB hit" that can be a huge performance knee however
> fast your TLB is. Depends heavily on the locality of reference and size
> of data set of the application, I suspect.

I'm sure there are really pathological examples, but the thing is, they
won't be on reasonable code.

Some modern CPU's have TLB's that can span the whole cache. In other
words, if your data is in _any_ level of caches, the TLB will be big
enough to find it.

Yes, that's not universally true, and when it's true, the TLB is two-level
and you can have loads where it will usually miss in the first level, but
we're now talking about loads where the _data_ will then always miss in
the first level cache too. So the TLB miss cost will always be _lower_
than the data miss cost.

Right now, you should buy Opteron if you want that kind of large TLB. I
_think_ Intel still has "small" TLB's (the cpuid information only goes up
to 128 entries, I think), but at least Intel has a really good fill. And I
would bet (but have no first-hand information) that next generation
processors will only get bigger TLB's. These things don't tend to shrink.

(Itanium also has a two-level TLB, but it's absolutely pitiful in size).

NOTE! It is absolutely true that for a few years we had regular caches
growing much faster than TLB's. So there are unquestionably unbalanced
machines out there. But it seems that CPU designers started noticing, and
every indication is that TLB's are catching up.

In other words, adding lots of kernel complexity is the wrong thing in the
long run. This is not a long-term problem, and even in the short term you
can fix it by just selecting the right hardware.

In todays world, AMD leads with bug TLB's (1024-entry L2 TLB), but Intel
has slightly faster fill and the AMD TLB filtering is sadly turned off on
SMP right now, so you might not always get the full effect of the large
TLB (but in HPC you probably won't have task switching blowing your TLB
away very often).

PPC64 has the huge hashed page tables that work well enough for HPC.

Itanium has a pitifully small TLB, and an in-order CPU, so it will take a
noticeably bigger hit on TLB's than x86 will. But even Itanium will be a
_lot_ better than MIPS was.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/