Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

From: Ingo Molnar
Date: Fri Nov 04 2005 - 11:40:46 EST

Next message: Linus Torvalds: "Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19"
Previous message: Jean-Christian de Rivaz: "Re: NTP broken with 2.6.14"
In reply to: Linus Torvalds: "Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19"
Next in thread: Linus Torvalds: "Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Linus Torvalds <torvalds@xxxxxxxx> wrote:

> Time it on a real machine some day. On a modern x86, you will fill a
> TLB entry in anything from 1-8 cycles if it's in L1, and add a couple
> of dozen cycles for L2.

below is my (x86-only) testcode that accurately measures TLB miss costs
in cycles. (Has to be run as root, because it uses 'cli' as the
serializing instruction.)

here's the output from the default 128MB (32768 4K pages) random access
pattern workload, on a 2 GHz P4 (which has 64 dTLBs):

0 24 24 24 12 12 0 0 16 0 24 24 24 12 0 12 0 12

32768 randomly accessed pages, 13 cycles avg, 73.751831% TLB misses.

i.e. really cheap TLB misses even in this very bad and TLB-trashing
scenario: there are only 64 dTLBs and we have 32768 pages - so they are
outnumbered by a factor of 1:512! Still the CPU gets it right.

setting LINEAR to 1 gives an embarrasing:

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

32768 linearly accessed pages, 0 cycles avg, 0.259399% TLB misses.

showing that the pagetable got fully cached (probably in L1) and that
has _zero_ overhead. Truly remarkable.

lowering the size to 16 MB (still 1:64 TLB-to-working-set-size ratio!)
gives:

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

4096 randomly accessed pages, 0 cycles avg, 5.859375% TLB misses.

so near-zero TLB overhead.

increasing BYTES to half a gigabyte gives:

2 0 12 12 24 12 24 264 24 12 24 24 0 0 24 12 24 24 24 24 24 24 24 24 12
12 24 24 24 36 24 24 0 24 24 0 24 24 288 24 24 0 228 24 24 0 0

131072 randomly accessed pages, 75 cycles avg, 94.162750% TLB misses.

so an occasional ~220 cycles (~== 100 nsec - DRAM latency) cachemiss,
but still the average is 75 cycles, or 37 nsecs - which is still only
~37% of the DRAM latency.

(NOTE: the test eliminates most data cachemisses, by using zero-mapped
anonymous memory, so only a single data page exists. So the costs seen
here are mostly TLB misses.)

Ingo

---------------
/*
* TLB miss measurement on PII CPUs.
*
* Copyright (C) 1999, Ingo Molnar <mingo@xxxxxxxxxx>
*/
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <sys/mman.h>

#define BYTES (128*1024*1024)
#define PAGES (BYTES/4096)

/* This define turns on the linear mode.. */
#define LINEAR 0

#if 1
# define BARRIER "cli"
#else
# define BARRIER "lock ; addl $0,0(%%esp)"
#endif

int do_test (char * addr)
{
unsigned long start, end;
/*
* 'cli' is used as a serializing instruction to
* isolate the benchmarked instruction from rdtsc.
*/
__asm__ (
"jmp 1f; 1: .align 128;\
"BARRIER"; \
rdtsc; \
movl %0, %1; \
"BARRIER"; \
movl (%%esi), %%eax; \
"BARRIER"; \
rdtsc; \
"BARRIER"; \
"

:"=a" (end), "=c" (start)
:"S" (addr)
:"dx","memory");
return end - start;
}

extern int iopl(int);

int main (void)
{
unsigned long overhead, sum;
int j, k, c, hit;
int matrix [PAGES];
int delta [PAGES];
char *buffer = mmap(NULL, BYTES, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

iopl(3);
/*
* first generate a random access pattern.
*/
for (j = 0; j < PAGES; j++) {
unsigned long val;
#if LINEAR
val = ((j*8) % PAGES) * 4096;
val = j*2048;
#else
val = (random() % PAGES) * 4096;
#endif
matrix[j] = val;
}

/*
* Calculate the overhead
*/
overhead = ~0UL;
for (j = 0; j < 100; j++) {
unsigned int diff = do_test(buffer);
if (diff < overhead)
overhead = diff;
}
printf("Overhead = %ld cycles\n", overhead);

/*
* 10 warmup loops, the last one is printed.
*/
for (k = 0; k < 10; k++) {
c = 0;
for (j = 0; j < PAGES; j++) {
char * addr;
addr = buffer + matrix[j];
delta[c++] = do_test(addr);
}
}
hit = 0;
sum = 0;
for (j = 0; j < PAGES; j++) {
unsigned long d = delta[j] - overhead;
printf("%ld ", d);
if (d <= 1)
hit++;
sum += d;
}
printf("\n");
printf("%d %s accessed pages, %d cycles avg, %f%% TLB misses.\n",
PAGES,
#if LINEAR
"linearly",
#else
"randomly",
#endif
sum/PAGES,
100.0*((double)PAGES-(double)hit)/(double)PAGES);

return 0;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Linus Torvalds: "Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19"
Previous message: Jean-Christian de Rivaz: "Re: NTP broken with 2.6.14"
In reply to: Linus Torvalds: "Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19"
Next in thread: Linus Torvalds: "Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]