Re: [PATCH v4 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning
From: Nick Desaulniers
Date: Wed Jan 20 2021 - 19:50:56 EST
On Tue, Jan 19, 2021 at 1:35 PM Arnd Bergmann <arnd@xxxxxxxxxx> wrote:
>
> On Tue, Jan 19, 2021 at 10:18 PM 'Nick Desaulniers' via Clang Built
> Linux <clang-built-linux@xxxxxxxxxxxxxxxx> wrote:
> >
> > On Tue, Jan 19, 2021 at 5:17 AM Adrian Ratiu <adrian.ratiu@xxxxxxxxxxxxx> wrote:
> > > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > > index b99dd8e1c93f..f9f3601cc2d1 100644
> > > --- a/arch/arm/lib/xor-neon.c
> > > +++ b/arch/arm/lib/xor-neon.c
> > > @@ -14,20 +14,22 @@ MODULE_LICENSE("GPL");
> > > #error You should compile this file with '-march=armv7-a -mfloat-abi=softfp -mfpu=neon'
> > > #endif
> > >
> > > +/*
> > > + * TODO: Even though -ftree-vectorize is enabled by default in Clang, the
> > > + * compiler does not produce vectorized code due to its cost model.
> > > + * See: https://github.com/ClangBuiltLinux/linux/issues/503
> > > + */
> > > +#ifdef CONFIG_CC_IS_CLANG
> > > +#warning Clang does not vectorize code in this file.
> > > +#endif
> >
> > Arnd, remind me again why it's a bug that the compiler's cost model
> > says it's faster to not produce a vectorized version of these loops?
> > I stand by my previous comment: https://bugs.llvm.org/show_bug.cgi?id=40976#c8
>
> The point is that without vectorizing the code, there is no point in building
> both the default xor code and a "neon" version that has to save/restore
> the neon registers but doesn't actually use them.
Doesn't that already happen today with GCC when the pointer arguments
are overlapping? The loop is "versioned" and may not actually use the
NEON registers at all at runtime for such arguments.
https://godbolt.org/z/q48q8v See also:
https://bugs.llvm.org/show_bug.cgi?id=40976#c11. Or am I missing
something?
So I'm thinking if we extend out this pattern to the rest of the
functions, we can actually avoid calls to
kernel_neon_begin()/kernel_neon_end() for cases in which pointers
would be too close to use the vectorized loop version; meaning for GCC
this would be an optimization (don't save neon registers when you're
not going to use them). I would probably consider moving
include/asm-generic/xor.h somewhere under arch/arm/
perhaps...err...something for the other users of <asm-generic/xor.h>.
diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
index aefddec79286..abd748d317e8 100644
--- a/arch/arm/include/asm/xor.h
+++ b/arch/arm/include/asm/xor.h
@@ -148,12 +148,12 @@ extern struct xor_block_template const
xor_block_neon_inner;
static void
xor_neon_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
{
- if (in_interrupt()) {
- xor_arm4regs_2(bytes, p1, p2);
- } else {
+ if (!in_interrupt() && abs((uintptr_t)p1, (uintptr_t)p2) >= 8) {
kernel_neon_begin();
xor_block_neon_inner.do_2(bytes, p1, p2);
kernel_neon_end();
+ } else {
+ xor_arm4regs_2(bytes, p1, p2);
}
}
diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
index b99dd8e1c93f..0e8e474c0523 100644
--- a/arch/arm/lib/xor-neon.c
+++ b/arch/arm/lib/xor-neon.c
@@ -14,22 +14,6 @@ MODULE_LICENSE("GPL");
#error You should compile this file with '-march=armv7-a
-mfloat-abi=softfp -mfpu=neon'
#endif
-/*
- * Pull in the reference implementations while instructing GCC (through
- * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
- * NEON instructions.
- */
-#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
-#pragma GCC optimize "tree-vectorize"
-#else
-/*
- * While older versions of GCC do not generate incorrect code, they fail to
- * recognize the parallel nature of these functions, and emit plain ARM code,
- * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
- */
-#warning This code requires at least version 4.6 of GCC
-#endif
-
#pragma GCC diagnostic ignored "-Wunused-variable"
#include <asm-generic/xor.h>
diff --git a/include/asm-generic/xor.h b/include/asm-generic/xor.h
index b62a2a56a4d4..69df62095c33 100644
--- a/include/asm-generic/xor.h
+++ b/include/asm-generic/xor.h
@@ -8,7 +8,7 @@
#include <linux/prefetch.h>
static void
-xor_8regs_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
+xor_8regs_2(unsigned long bytes, unsigned long * __restrict p1,
unsigned long * __restrict p2)
{
long lines = bytes / (sizeof (long)) / 8;
--
Thanks,
~Nick Desaulniers