Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)

From: Linus Torvalds
Date: Mon Dec 29 2003 - 16:38:38 EST




On Mon, 29 Dec 2003, Eric W. Biederman wrote:
>
> Linus Torvalds <torvalds@xxxxxxxx> writes:
> >
> > Why? Because the fact is, that as memory gets further and further away
> > from CPU's, caches have gotten further and further away from being direct
> > mapped.
>
> Except for L1 caches. The hit of an associate lookup there is inherently
> costly.

Having worked for a hardware company, and talked to hardware engineers, I
can say that it generally isn't all that true.

The reason is that you can start the lookup before you even do the TLB
lookup, and in fact you _want_ highly associative L1 caches to do that.

For example, if you have a 16kB L1 cache, and a 4kB page size, and you
want your memory accesses to go fast, you definitely want to index the L1
by the virtual access, which means that you can only use the low 12 bits
for indexing.

So what you do is you make your L1 be 4-way set-associative, so that by
the time the TLB lookup is done, you've already looked up the index, and
you only have to compare the TAG with one of the four possible ways.

In short: you actually _want_ your L1 to be associative, because it's the
best way to avoid having nasty alias issues.

The only people who have a direct-mapped L1 are one of:
- crazy and/or stupid
- really cheap (mainly embedded space)
- not high-performance anyway (ie their L1 is really small)
- really sorry, and are fixing it.
- really _really_ sorry, and have a virtually indexed cache. In which
case page coloring doesn't matter anyway.

Notice how high performance is _not_ on the list. Because you simply can't
_get_ high performance with a direct-mapped L1. Those days are long gone.

There is another reason why L1's have long since moved away from
direct-mapped: the miss ratio goes up quote a bit for the same size cache.
And things like OoO are pretty good at hiding one cycle of latency (OoO is
_not_ good at hiding memory latency, but one or two cycles are usually
ok), so even if having a larger L1 (and thus inherently more complex - not
only in associativity) means that you end up having an extra cycle access,
it's likely a win.

This is, for example, what alpha did between 21164 and the 21264: when
they went out-of-order, they did all the simulation to prove that it was
much more efficient to have a larger L1 with a higher hit ratio, even if
the latency was one cycle higher than the 21164 which was strictly
in-order.

In short, I'll bet you a dollar that you won't see a single direct-mapped
L1 _anywhere_ where it matters. They are already pretty much gone. Can you
name one that doesn't fit the four criteria above?

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/