RE: [PATCH 2/6] lib/lzo: enable 64-bit CTZ on Arm
From: Matt Sealey
Date: Wed Nov 21 2018 - 13:30:39 EST
Hi Christoph,
> > #define LZO_USE_CTZ32 1
> > #elif defined(__i386__) || defined(__powerpc__)
> > #define LZO_USE_CTZ32 1
> > -#elif defined(__arm__) && (__LINUX_ARM_ARCH__ >= 5)
> > +#elif defined(__arm__)
> > +#if (__LINUX_ARM_ARCH__ >= 5)
> > #define LZO_USE_CTZ32 1
> > #endif
> > +#if (__LINUX_ARM_ARCH__ >= 6) && (CONFIG_THUMB2_KERNEL)
> > +#define LZO_USE_CTZ64 1
>
> All of this really needs to be driven by Kconfig symbols that
> the architecture selects instead of magic arch ifdefs here.
TL;DR summary: Low hanging fruit.
TL explanation of the thought process:
Rather than using arch ifdefs, it seems we can do __builtin_ctzll()
and __builtin_clzll() on all architectures with no measurable impact
to codegen - libgcc implements them as __builtin_ctz/clz everywhere
I've looked, so gcc at least is going to use that code for the
builtins. On 32-bit Arm it introduces a tiny bit of extra register
pressure in the instruction sequence but it's not measurable.
That is to say that the difference between architectures or
architecture versions isn't the availability of a count-leading-zeros
optimization or the builtin or an efficient algorithm, but the code
preceding it to collect a 64-bit quantity to count.
That is a whole other cleanup.
Longer:
Just to further the point, though, if we rationalize the use of
builtins to use the 64-bit versions, then we may as well use the
kernel version -- __ffs64 is __builtin_clzll in asm-generic with
the right magic arch Kconfig, and if not arches provide exactly
the same code as LZO's fallback when LZO_USE_CxZnn isn't defined.
Even the distinction between little and big endian is moot, since
__builtin_clzll(cpu_to_be32(n))
amounts to precisely the same disassembly on both arm and arm64
as __builtin_ctzll(n), so we can just go the whole hog, replace
the 32-bit loads and xor with 64-bit ones, and do:
__ffs64(cpu_to_be32(n))
which greatly reduces code complexity in the main LZO code by
removing the distinction.
It turns out the compiler knows what it's doing, and where it
doesn't, the arch maintainers do.
Therein lies the rub; the source gets smaller and we lose the
arch/endian distinction but the efficiency of the preceding 64-bit
load and exclusive-or needs investigation. We still have to
figure out per architecture what the most efficient element size
is or if there's a hideous xor((u64) n) implementation.
I simply don't have confidence in anything except that the magic
arch ifdefs work to improve performance without changing too much,
and the mess it introduces here belies the performance improvement
it provides. Versus the true scope of the code cleanup to do it
right, it's a very concise and reliable and within-style improvement.
Would we agree to let it in for now on the basis that we thought
about it but I side with MFXJO's urgency?
Ta!
Matt Sealey <matt.sealey@xxxxxxx>
Arm Partner Enablement Group