On Tue, Jun 11, 2002 at 04:16:56PM +0100, David Woodhouse wrote:
> Or were you _really_ advocating the kind of development methodology which
Not at all. You can sometimes sanely re-organise and optimise your C
code to produce better assembly without affecting its readability too
much.
> Tweaking your code and sacrificing chickens until you happen to get the
> output you want is no substitute for fixing the compiler.
"Fixing the compiler" for ARM would entail getting a degree in compiler
design, and rewriting GCC from scratch. GCC *really* isn't suited to
ARM CPUs at all. It sucks. The fact that earlier GCC versions sucked
less is the really interesting thing here...
GCC 3.1 is a hard compiler to find stuff wrong with (unlike previous
versions - well done gcc people!) However, here's a good example of a
bit of madvise_fixup_start, built for Xscale with a 5-stage instruction
pipeline with branch prediction.
It's from mm/filemap.c, setup_read_behaviour (inline function inside
madvise_fixup_start).
1. The C code:
VM_ClearReadHint(vma);
switch(behavior) {
case MADV_SEQUENTIAL:
vma->vm_flags |= VM_SEQ_READ;
break;
case MADV_RANDOM:
vma->vm_flags |= VM_RAND_READ;
break;
default:
break;
}
2. GCC 3.1 output with -O2 on ARM:
ldr r3, [r4, #20]
cmp r6, #1
bic r3, r3, #98304
str r7, [r4, #8]
str r3, [r4, #20]
beq .L803 <=== branch if equal
cmp r6, #2
orreq r3, r3, #32768
beq .L811 <=== branch if equal
.L806:
ldr r0, [r4, #56]
...
b .L809
.L811:
str r3, [r4, #20]
b .L806 <=== unconditional branch
.L803:
orr r3, r3, #65536
b .L811 <=== unconditional branch
This gives the following instruction path lengths:
neither: 10 (no branches)
first: 11 (3 branches)
second: 12 (2 branches)
-Os doesn't make much difference.
3. My human-based optimised output for ARM is:
ldr r3, [r4, #20]
cmp r6, #1
bic r3, r3, #98304
str r7, [r4, #8]
orreq r3, r3, #65536
beq .L806 <=== branch if equal
cmp r6, #2
orreq r3, r3, #32768
.L806:
str r3, [r4, #20]
ldr r0, [r4, #56]
...
b .L809
This gives the following instruction path lengths:
neither: 10 (no branches)
first: 8 (3 branches)
second: 10 (2 branches)
Any way you look at it, the above has got to be better.
We can probably get something very close to (3) out of gcc by doing
something like:
unsigned long flags;
flags = vma->vm_flags & ~(VM_SEQ_READ|VM_RAND_READ);
switch(behavior) {
case MADV_SEQUENTIAL:
flags |= VM_SEQ_READ;
break;
case MADV_RANDOM:
flags |= VM_RAND_READ;
break;
default:
break;
}
vma->vm_flags = flags;
No gotos required. So, what about VM_ClearReadHint()? It turns out
that its only used in once place - setup_read_behaviour(). Now, with
the above, we end up with the following output from the same version
of gcc with -O2:
ldr r3, [r4, #20]
cmp r6, #1
bic r3, r3, #98304
str r7, [r4, #8]
orreq r3, r3, #65536
beq .L801
cmp r6, #2
orreq r3, r3, #32768
.L801:
ldr r0, [r4, #56]
str r3, [r4, #20]
All I've shown above is how GCC can miss some optimisations its easy
for a human to see, and there are some simple ways to rewrite C code
to allow GCC to make those optimisations. But hey - that's nothing
new.
-- Russell King (rmk@arm.linux.org.uk) The developer of ARM Linux http://www.arm.linux.org.uk/personal/aboutme.html- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Sat Jun 15 2002 - 22:00:23 EST