There is no good way to do this.
> since using the unaligned access commands takes much more time
> than a normal accesss, so the additional test shouldn't be too bad
> and it would give a big performace win for aligned accesses.
False. Branch predict miss penalties are a bear -- 3 cycles on the ev4
and 5 cycles on the ev5:
# issue cycle numbers
# ev4 ev5
# branching fallthru branching fallthru
and addr,7,r1 # 0 0
beq r1,1f # 1 0
ldq_u r1,0(addr) # 2 0 1 0
ldq_u r2,7(addr) # 3 1 1 0
extql r1,addr,r1 # 5 3 3 2
extqh r2,addr,r2 # 6 4 4 3
or r1,r2,r1 # 9 7 5 4
br 2f # 10 5
1: ldq r1,0(addr) # 4 5
2: /* use r1 */ # 11 7 8 6 7 5
(Assumes data is in L1 cache, otherwise load delays dominate and the
whole issue is moot.)
So on the ev4 fallthru is 1 cycle slower for aligned accesses --
hardly much more time -- and on the ev5 fallthru always wins.
r~