Umm, while I agree that it might make it a lot easier to debug things on the
alpha, I hope nobody writes a patch like that for gdb. The reason I hope that
is that I hope that it will be unnecessary. The _correct_ way to handle it is
to never generate those d*mn instructions in the first place. We need an
optimizing linker that gets rid of them and makes it into a single "bsr"
instruction instead.
The reasons for doing it with "bsr":
- it's smaller (no data space needed, and only 1 insn)
- it's faster (roughly 10-1000 times faster! The load will stall, and it
will stall a _lot_ if it wasn't in the cache)
- it doesn't have any problems like the above when debugging - you see
the bsr target quite well.
- secondary issues: cache footprints etc.
The reason it's not done with "bsr" now:
- the compiler can't know at compile-time if the target will be within
21 bits (or whatever the branch offset was). In 99% of all cases it
_will_ be in range, and it's essentially just calling a shared library
that can't be optimized.
- thus the linker needs to do the optimizing, and the linker isn't
clever enough (it should also look at the target, and jump over any
"ldgp" instructions when it's within the same ldgp area).
So if somebody wants to do something worthwhile on the alpha, and doesn't
like the instruction scheduling stuff, _this_ single optimization would
probably make more of a difference.. The easy way to handle it is within
the compiler (make gcc always use "bsr" and never reload "gp"), but that
doesn't work for the generic case, so it's not really worth pursuing (I
wrote patches to gcc that did it, but I threw them away after I noticed
all the problems with shared libs etc).
gcc _does_ produce "bsr" calls, but only for local jumps (static
functions or calls to a function that was defined earlier in the same
source file). That makes it very inefficient to do subroutine calls (it's
not just the added instructions, it's the data segment access I don't
like either).
There are some other optimizations that it would be interesting if gcc did:
"known alignment byte and word fetch/store". There are cases where you know
the alignment of a byte or short value (when the access is through a
structure or union that has guaranteed alignment due to other larger
members), but gcc creates code for the generic byte load case. For example:
struct example {
struct example * next;
unsigned char data;
};
gcc will compile "example->data = 1" as something like this (note the silly
"addq $16,8,$3", even though we are only interested in the low 7 bits of the
address for the byte masking ops, but note even more the whole broken
algorithm):
ldq_u $2,8($16)
bis $31,1,$1
addq $16,8,$3
insbl $1,$3,$1
mskbl $2,$3,$2
bis $1,$2,$1
stq_u $1,8($16)
it _should_ be something like this (note that this example happens to be
exactly aligned on a 8-byte boundary, but the same thing works with very
minor modifications for any structure byte access, because we know what
the alignment withing the quadword should be):
ldq $1,8($16)
zap $1,1,$1
bis $1,1,$1
stq $1,8($16)
I would really like to get the code density a bit better - alpha binaries are
inherently larger than x86 and most others, but it doesn't exactly help that
gcc is stupid in some ways. And I'd like to work on gcc - I've done small
hacks before and it could be interesting - but I don't honestly think I'll
have the time to do anything real..
Linus