Re: PROBLEM: mkdep.c lacks a little endian definition for little endian mips machines

Michael Elizabeth Chastain (mec@shout.net)
Sat, 6 Feb 1999 13:59:40 -0600


This mkdep.c little-endian problem really itched me, so I scratched it.
I wrote a patch to mkdep.c that blows away the architecture-sensitive
LE_MACHINE stuff and has a simple implementation of GET_NEXT.

I'm aware that a lot of people here run 'make dep' a lot and they care
about speed. So I ran a speed test.

My simple version is ten percent *faster* on my processor, and eight
percent *faster* on my ISP's processor.

Now personally I think it's wrong to put architecture-sensitive tests
into mkdep.c in the first place. It's a freaking kbuild program, not
the system call entry sequence. The only justification for a complex
implementation is speed. But it turns out the complex implementation is
*slower* by actual measurement. It's slower because it has more branches.

So let's kill it already. Patch appended. Try it out; let me know
if it works for you.

Michael Elizabeth Chastain
<mailto:mec@shout.net>
"love without fear"

===

diff -u -r -N linux-2.2.2-pre2/scripts/mkdep.c linux/scripts/mkdep.c
--- linux-2.2.2-pre2/scripts/mkdep.c Fri Jan 8 13:10:11 1999
+++ linux/scripts/mkdep.c Sat Feb 6 06:05:33 1999
@@ -188,31 +188,14 @@


/*
- * Macros for stunningly fast map-based character access.
- * __buf is a register which holds the current word of the input.
- * Thus, there is one memory access per sizeof(unsigned long) characters.
+ * Macros for map-based character access.
+ * There is one memory access per character.
*/

-#if defined(__alpha__) || defined(__i386__) || defined(__arm__)
-#define LE_MACHINE
-#endif
-
-#ifdef LE_MACHINE
-#define next_byte(x) (x >>= 8)
-#define current ((unsigned char) __buf)
-#else
-#define next_byte(x) (x <<= 8)
-#define current (__buf >> 8*(sizeof(unsigned long)-1))
-#endif
-
#define GETNEXT { \
- next_byte(__buf); \
- if ((unsigned long) next % sizeof(unsigned long) == 0) { \
- if (next >= end) \
- break; \
- __buf = * (unsigned long *) next; \
- } \
- next++; \
+ if (next >= end) \
+ break; \
+ current = *next++; \
}

/*
@@ -253,7 +236,7 @@
{
const char * next = map;
const char * map_dot;
- unsigned long __buf = 0;
+ char current;

for (;;) {
start:

===

How I tested: I started with linux kernel 2.2.2-pre2. I went into
drivers/char and did 'wc *.[hcS]' to load all the files into the disk
cache. Then I ran these commands five times:

time scripts/old-mkdep *.[hcS] > .depend # old mkdep
time scripts/mkdep *.[hcS] > .depend # simple GET_NEXT mkdep

My machine is an IBM 486/SLC-2-66 MHz, 16 megabytes memory, 14 bogomips.
I use gcc version 2.7.2.

Here are the timing figures:

/* old mkdep */
2.04user 0.79system 0:02.85elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
2.07user 0.79system 0:02.88elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
2.04user 0.79system 0:02.83elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
1.99user 0.85system 0:02.84elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
2.05user 0.82system 0:02.89elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k

/* simple GET_NEXT mkdep */
1.75user 0.76system 0:02.52elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
1.76user 0.77system 0:02.54elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
1.78user 0.74system 0:02.53elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
1.75user 0.75system 0:02.52elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
1.77user 0.74system 0:02.52elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k

I looked carefully at the generated code for GET_NEXT (the important
invocation in the inner loop). Here's the code:

/* old mkdep */
.L104:
shrl $8,%ebx
testl $3,%esi
jne .L105
cmpl %edi,%esi
jae .L102
movl (%esi),%ebx
.L105:
incl %esi

/* simple GET_NEXT mkdep */
.L104:
cmpl %edi,%ebx
jae .L102
movb (%ebx),%cl
incl %ebx

The old mkdep takes an extra conditional jump. My processor doesn't have
branch prediction, so the 'jne .L105' causes a pipeline stall. This jump
is not a compiler artifact; it comes about because the fancy GETNEXT needs
to have a test on each character to see if it's at the end of the word.

I also tried hand-tuning the inner loop of the old mkdep to make the
common case be an untaken conditional branch. No joy; the performance
did not get any better. The code still has to take an out-of-line branch
every sizeof(unsigned long) characters.

I repeated these tests on my ISP, which has a GenuineIntel 686 Model 5
with 300 bogomips. The gap was about the same; average 0.66 seconds
(over a bigger set of files) versus 0.61 seconds.

I don't have access to any non-i386 machines to test performance there,
so if someone would like to do head-to-head performance measurements,
go for it. Post the numbers and the details.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/