Re: [PATCH] powerpc/32: Remove one insn in __bswapdi2

From: Gabriel Paubert
Date: Thu Aug 11 2016 - 17:35:15 EST


On Wed, Aug 10, 2016 at 12:18:15PM +0200, Christophe Leroy wrote:
>
>
> Le 10/08/2016 à 10:56, Gabriel Paubert a écrit :
> >On Fri, Aug 05, 2016 at 01:28:02PM +0200, Christophe Leroy wrote:
> >>Signed-off-by: Christophe Leroy <christophe.leroy@xxxxxx>
> >>---
> >> arch/powerpc/kernel/misc_32.S | 3 +--
> >> 1 file changed, 1 insertion(+), 2 deletions(-)
> >>
> >>diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
> >>index e025230..e18055c 100644
> >>--- a/arch/powerpc/kernel/misc_32.S
> >>+++ b/arch/powerpc/kernel/misc_32.S
> >>@@ -578,9 +578,8 @@ _GLOBAL(__bswapdi2)
> >> rlwimi r9,r4,24,0,7
> >> rlwimi r10,r3,24,0,7
> >> rlwimi r9,r4,24,16,23
> >>- rlwimi r10,r3,24,16,23
> >>+ rlwimi r4,r3,24,16,23
> >> mr r3,r9
> >>- mr r4,r10
> >> blr
> >>
> >
> >Hmmm, are you sure that it works? rlwimi is a bit special since the
> >first operand is both an input and an output of the instruction.
> >
> >
>
> Oops, you are right ...

I just found this:

http://hardwarebug.org/2010/01/14/beware-the-builtins/

the bswapdi2 suggested sequence only needs a single mr instruction, the
other one is absorbed in a rotlwi.

The scheduling looks poor, but it seems impossible to interleave the
operations between the two halves without adding another instructions,
and the routine is 8 instructions long, which happens to be exactly a
cache line on most 32 bit processors.

On the other hand gcc did at the time a very poor job (quite an
understatement) at bswapdi when compiling for 64 bit processors
(see the example).

But what do modern compilers generate for bswapdi these days? Do they
still call the library or not?

After all, bswapdi on 32 bit processors only takes 6 instructions if the
input and output registers don't overlap.

Gabriel