RE: touch_cache() only touches two thirds

From: dean gaudet
Date: Fri Nov 17 2006 - 23:18:47 EST


On Fri, 10 Nov 2006, Bela Lubkin wrote:

> The corrected code in <http://bugzilla.kernel.org/show_bug.cgi?id=7476#c4>
> covers the full cache range. Granted that modern CPUs may be able to track
> multiple simultaneous cache access streams: how many such streams are they
> likely to be able to follow at once? It seems like going from 1 to 2 would
> be a big win, 2 to 3 a small win, beyond that it wouldn't likely make much
> incremental difference. So what do the actual implementations in the field
> support?

p-m family, core, core2 track one stream on each of 12 to 16 pages. in
the earlier ones they split the trackers into some forward-only and some
backward-only, but on core2 i think they're all bidirectional. if i had
to guess they round-robin the trackers, so once you hit 17 pages with
streams they're defeated.

a p4 (0f0403, probably "prescott") i have here is tracking 16 -- seems to
use LRU or pLRU but i haven't tested really, you need to get out past 32
streams before it really starts falling off... and even then the next-line
prefetch in the L2 helps too much (64-byte lines, but 128-byte tags and a
pair of dirty/state bits -- it prefetches the other half of a pair
automatically). oh it can track forward or backward, and is happy with
strides up to 128.

k8 rev F tracks one stream on each of 20 pages (forward or backward). it
also seems to use round-robin, and is defeated as soon as you have 21
streams.

i swear there was an x86 which did 28 streams, but it was a few years ago
that i last looked really closely at the prefetchers and i don't have
access to the data at the moment.

i suggest that streams are the wrong approach. i was actually considering
this same problem this week, happy to see your thread.

the approach i was considering was to set up two pointer chases:

one pointer chase covering enough cache lines (and in a prefetchable
ordering) for "scrubbing" the cache(s).

another pointer chase arranged to fill the L1 (or L2) using many many
pages. i.e. suppose i wanted to traverse 32KiB L1 with 64B cache lines
then i'd allocate 512 pages and put one line on each page (pages ordered
randomly), but colour them so they fill the L1. this conveniently happens
to fit in a 2MiB huge page on x86, so you could even ameliorate the TLB
pressure from the microbenchmark.

you can actually get away with a pointer every 256 bytes today -- none of
the prefetchers on today's x86 cores consider a 256 byte stride to be
prefetchable. for safety you might want to use 512 byte alignment... this
lets you get away with fewer pages for colouring larger caches.

the benchmark i was considering would be like so:

switch to cpu m
scrub the cache
switch to cpu n
scrub the cache
traverse the coloured list and modify each cache line as we go
switch to cpu m
start timing
traverse the coloured list without modification
stop timing

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/