Hi Luke,Not a problem. I was just curious for my own selfish reasons.
On 12/05/16 16:34, Luke Starrett wrote:
Hi Robin,
I pulled this in to a userspace test app expecting that the __uint128_t
type might cause GCC to emit 'ldp'. Seems like that was that your
intent based on your commit note. Instead I see two 64b loads (ldr Xn),
and a single 32b load (ldr Wn) for the trailing 4B. This was with
Linaro GCC 4.9-2015.06.
GCC 5 happily emits ldp there, but indeed I couldn't figure out how to convince GCC 4 to do so. From a quick ferret around in the GCC Git, it looks like the relevant optimisations may have only gone in post-4.9.
Otherwise, the C cycle count looks good enough compared to the asm version.
Yeah, compiling as standalone functions with GCC 5 I get 19 instructions vs. 17 for the asm, but the loop logic gets optimised out completely when ihl is a compile-time constant (e.g. inet_gro_receive())