Re: [PATCH 1/1] riscv: Enable ARCH_HAS_FAST_MULTIPLIER for RV64I

From: Chenxi Mao
Date: Wed Jul 22 2020 - 22:57:46 EST


Hi Palmer:


Did you mean we drop this totally or drop this for __sw_hweight32 only?


Chenxi

On 2020/7/23 äå10:13, Palmer Dabbelt wrote:
> On Wed, 22 Jul 2020 18:59:12 PDT (-0700), maochenxi@xxxxxxxxx wrote:
>> Hi Palmer and Emil:
>>
>> As Emil mentioned in previous E-mail loop, I did the same test on my kernel as well.
>
> Sorry, I guess I crossed up my emails. I think it's best to just drop this for
> now, as it doesn't actually seem to generate better code for our current
> target.
>
>>
>> My kernel is based on Linux 5.8-RC6 with GCC-10.1. (ISA C extension enabled)
>>
>> The disassembly code as below:
>>
>> CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled:
>>
>> 0000000000000000 <__sw_hweight32>:
>> ÂÂ 0:ÂÂÂ 555557b7ÂÂÂÂÂÂÂÂÂ ÂÂÂ luiÂÂÂ a5,0x55555
>> ÂÂ 4:ÂÂÂ 0015571bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a4,a0,0x1
>> ÂÂ 8:ÂÂÂ 55578793ÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ a5,a5,1365 # 55555555 <.LASF5+0x5555509d>
>> ÂÂ c:ÂÂÂ 8ff9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ andÂÂÂ a5,a5,a4
>> ÂÂ e:ÂÂÂ 9d1dÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ subwÂÂÂ a0,a0,a5
>>
>> 0000000000000010 <.LVL1>:
>> Â 10:ÂÂÂ 333337b7ÂÂÂÂÂÂÂÂÂ ÂÂÂ luiÂÂÂ a5,0x33333
>> Â 14:ÂÂÂ 33378793ÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ a5,a5,819 # 33333333 <.LASF5+0x33332e7b>
>> Â 18:ÂÂÂ 0025571bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a4,a0,0x2
>> Â 1c:ÂÂÂ 8d7dÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ andÂÂÂ a0,a0,a5
>> Â 1e:ÂÂÂ 8ff9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ andÂÂÂ a5,a5,a4
>> Â 20:ÂÂÂ 9fa9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addwÂÂÂ a5,a5,a0
>> Â 22:ÂÂÂ 0047d51bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a0,a5,0x4
>> Â 26:ÂÂÂ 9fa9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addwÂÂÂ a5,a5,a0
>> Â 28:ÂÂÂ 0f0f1537ÂÂÂÂÂÂÂÂÂ ÂÂÂ luiÂÂÂ a0,0xf0f1
>> Â 2c:ÂÂÂ 1141ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ sp,sp,-16
>> Â 2e:ÂÂÂ f0f50513ÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ a0,a0,-241 # f0f0f0f <.LASF5+0xf0f0a57>
>> Â 32:ÂÂÂ e422ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ sdÂÂÂ s0,8(sp)
>> Â 34:ÂÂÂ 8fe9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ andÂÂÂ a5,a5,a0
>> Â 36:ÂÂÂ 0800ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ s0,sp,16
>> Â 38:ÂÂÂ 0087951bÂÂÂÂÂÂÂÂÂ ÂÂÂ slliwÂÂÂ a0,a5,0x8
>> Â 3c:ÂÂÂ 6422ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ldÂÂÂ s0,8(sp)
>> Â 3e:ÂÂÂ 9d3dÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addwÂÂÂ a0,a0,a5
>> Â 40:ÂÂÂ 0105179bÂÂÂÂÂÂÂÂÂ ÂÂÂ slliwÂÂÂ a5,a0,0x10
>> Â 44:ÂÂÂ 9d3dÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addwÂÂÂ a0,a0,a5
>> Â 46:ÂÂÂ 0185551bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a0,a0,0x18
>> Â 4a:ÂÂÂ 0141ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ sp,sp,16
>> Â 4c:ÂÂÂ 8082ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ret
>>
>> CONFIG_ARCH_HAS_FAST_MULTIPLIER disabled:
>>
>> 000000000000004e <__sw_hweight32_default>:
>> Â 4e:ÂÂÂ 55555737ÂÂÂÂÂÂÂÂÂ ÂÂÂ luiÂÂÂ a4,0x55555
>> Â 52:ÂÂÂ 0015579bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a5,a0,0x1
>> Â 56:ÂÂÂ 55570713ÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ a4,a4,1365 # 55555555 <.LASF5+0x5555509d>
>> Â 5a:ÂÂÂ 8ff9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ andÂÂÂ a5,a5,a4
>> Â 5c:ÂÂÂ 9d1dÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ subwÂÂÂ a0,a0,a5
>>
>> 000000000000005e <.LVL3>:
>> Â 5e:ÂÂÂ 333337b7ÂÂÂÂÂÂÂÂÂ ÂÂÂ luiÂÂÂ a5,0x33333
>> Â 62:ÂÂÂ 33378793ÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ a5,a5,819 # 33333333 <.LASF5+0x33332e7b>
>> Â 66:ÂÂÂ 0025571bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a4,a0,0x2
>> Â 6a:ÂÂÂ 8d7dÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ andÂÂÂ a0,a0,a5
>> Â 6c:ÂÂÂ 8ff9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ andÂÂÂ a5,a5,a4
>> Â 6e:ÂÂÂ 9fa9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addwÂÂÂ a5,a5,a0
>> Â 70:ÂÂÂ 0047d51bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a0,a5,0x4
>> Â 74:ÂÂÂ 9d3dÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addwÂÂÂ a0,a0,a5
>> Â 76:ÂÂÂ 0f0f17b7ÂÂÂÂÂÂÂÂÂ ÂÂÂ luiÂÂÂ a5,0xf0f1
>> Â 7a:ÂÂÂ 1141ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ sp,sp,-16
>> Â 7c:ÂÂÂ f0f78793ÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ a5,a5,-241 # f0f0f0f <.LASF5+0xf0f0a57>
>> Â 80:ÂÂÂ e422ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ sdÂÂÂ s0,8(sp)
>> Â 82:ÂÂÂ 8fe9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ andÂÂÂ a5,a5,a0
>> Â 84:ÂÂÂ 0800ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ s0,sp,16
>> Â 86:ÂÂÂ 0087d51bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a0,a5,0x8
>> Â 8a:ÂÂÂ 6422ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ldÂÂÂ s0,8(sp)
>> Â 8c:ÂÂÂ 9fa9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addwÂÂÂ a5,a5,a0
>> Â 8e:ÂÂÂ 0107d51bÂÂÂÂÂÂÂÂÂ ÂÂÂ srliwÂÂÂ a0,a5,0x10
>> Â 92:ÂÂÂ 9d3dÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addwÂÂÂ a0,a0,a5
>> Â 94:ÂÂÂ 0ff57513ÂÂÂÂÂÂÂÂÂ ÂÂÂ andiÂÂÂ a0,a0,255
>> Â 98:ÂÂÂ 0141ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ addiÂÂÂ sp,sp,16
>> Â 9a:ÂÂÂ 8082ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ret
>>
>> This 2 implementations is almost same but small differences.
>>
>> Especially in CONFIG_ARCH_HAS_FAST_MULTIPLIER condition, below code didn't use "mul" instructions.
>>
>> ÂÂÂ " return (w * 0x01010101) >> 24; "
>>
>> So I am trying to translate this code with inline assembly as below:
>>
>> //return (w * 0x01010101) >> 24;
>> __asm__ (
>> " mul %0, %0, %1\n"
>> : "+r" (w)
>> : "r" (w), "r"(0x01010101)
>> :);
>> return w >> 24;
>>
>> After above change, the disassambly as below:
>> 0000000000000000 <__sw_hweight32>:
>> ÂÂ 0:ÂÂ Â555557b7ÂÂÂÂÂÂÂÂ ÂÂÂ ÂluiÂÂ Âa5,0x55555
>> ÂÂ 4:ÂÂ Â0015571bÂÂÂÂÂÂÂÂ ÂÂÂ ÂsrliwÂÂ Âa4,a0,0x1
>> ÂÂ 8:ÂÂ Â55578793ÂÂÂÂÂÂÂÂ ÂÂÂ ÂaddiÂÂ Âa5,a5,1365 # 55555555 <.LASF5+0x55555119>
>> ÂÂ c:ÂÂ Â8ff9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂandÂÂ Âa5,a5,a4
>> ÂÂ e:ÂÂ Â9d1dÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂsubwÂÂ Âa0,a0,a5
>>
>> 0000000000000010 <.LVL1>:
>> Â 10:ÂÂ Â333337b7ÂÂÂÂÂÂÂÂ ÂÂÂ ÂluiÂÂ Âa5,0x33333
>> Â 14:ÂÂ Â0025571bÂÂÂÂÂÂÂÂ ÂÂÂ ÂsrliwÂÂ Âa4,a0,0x2
>> Â 18:ÂÂ Â33378793ÂÂÂÂÂÂÂÂ ÂÂÂ ÂaddiÂÂ Âa5,a5,819 # 33333333 <.LASF5+0x33332ef7>
>> Â 1c:ÂÂ Â8d7dÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂandÂÂ Âa0,a0,a5
>> Â 1e:ÂÂ Â8ff9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂandÂÂ Âa5,a5,a4
>> Â 20:ÂÂ Â9fa9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂaddwÂÂ Âa5,a5,a0
>> Â 22:ÂÂ Â0047d71bÂÂÂÂÂÂÂÂ ÂÂÂ ÂsrliwÂÂ Âa4,a5,0x4
>> Â 26:ÂÂ Â9f3dÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂaddwÂÂ Âa4,a4,a5
>> Â 28:ÂÂ Â0f0f17b7ÂÂÂÂÂÂÂÂ ÂÂÂ ÂluiÂÂ Âa5,0xf0f1
>> Â 2c:ÂÂ Â1141ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂaddiÂÂ Âsp,sp,-16
>> Â 2e:ÂÂ Âf0f78793ÂÂÂÂÂÂÂÂ ÂÂÂ ÂaddiÂÂ Âa5,a5,-241 # f0f0f0f <.LASF5+0xf0f0ad3>
>> Â 32:ÂÂ Âe422ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂsdÂÂ Âs0,8(sp)
>> Â 34:ÂÂ Â8ff9ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂandÂÂ Âa5,a5,a4
>> Â 36:ÂÂ Â0800ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂaddiÂÂ Âs0,sp,16
>> Â 38:ÂÂ Â01010737ÂÂÂÂÂÂÂÂ ÂÂÂ ÂluiÂÂ Âa4,0x1010
>> Â 3c:ÂÂ Â853eÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂmvÂÂ Âa0,a5
>>
>> 000000000000003e <.LVL2>:
>> Â 3e:ÂÂ Â1017071bÂÂÂÂÂÂÂÂ ÂÂÂ ÂaddiwÂÂ Âa4,a4,257
>> Â 42:ÂÂ Â02f50533ÂÂÂÂÂÂÂÂ ÂÂÂ ÂmulÂÂ Âa0,a0,a5
>> Â 46:ÂÂ Â6422ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ ÂldÂÂ Âs0,8(sp)
>> Â 48:ÂÂ Â0185551bÂÂÂÂÂÂÂÂ ÂÂÂ ÂsrliwÂÂ Âa0,a0,0x18
>>
>> "mul" instruction is leveraged as expectation, but 0x01010101 load waste several instructions.
>>
>> Based on this test, force to leverage "mul" instruction might be not faster than current compiler implementations.
>>
>> I am not sure above assembly is the best way to load 0x01010101? I checked the ISA manual, "lui" only
>>
>> load 20 bits per time, is this the best way to load instants?
>>
>>
>> On the other hand, I try to compare ARM64 disassembly code:
>>
>> .....
>>
>> ÂÂ 4:ÂÂÂ 3200c3e2 ÂÂÂ movÂÂÂ w2, #0x1010101ÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂ // #16843009
>>
>> ......
>>
>> ÂÂ w =Â (w + (w >> 4)) & 0x0f0f0f0f;
>> Â 20:ÂÂÂ 0b401000 ÂÂÂ addÂÂÂ w0, w0, w0, lsr #4
>> Â 24:ÂÂÂ 1200cc00 ÂÂÂ andÂÂÂ w0, w0, #0xf0f0f0f
>> ÂÂÂ return (w * 0x01010101) >> 24;
>> Â 28:ÂÂÂ 1b027c00 ÂÂÂ mulÂÂÂ w0, w0, w2
>>
>> Only one "mov" instructions to load 0x1010101 and one "mul" instruction for multiply.
>>
>>
>> Let me summary as below:
>>
>> 1. GCC 10.1 cannot generate "mul" instruction when CONFIG_ARCH_HAS_FAST_MULTIPLIER enabled.
>>
>> 2. force to generate "mul" didn't get better because instants load waste instructions.
>>
>> 3. If GCC compiler behavior is best solution for this case, we could have below work around on Riscv.
>>
>> Âunsigned int __sw_hweight32(unsigned int w)
>> Â{
>> -#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
>> +/*
>> + * Risc-V could not generate mul(w) instruction in this case
>> + */
>> +#if defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER) && !defined(CONFIG_RISCV)
>> ÂÂÂÂÂÂÂ w -= (w >> 1) & 0x55555555;
>> ÂÂÂÂÂÂÂ w =Â (w & 0x33333333) + ((w >> 2) & 0x33333333);
>> ÂÂÂÂÂÂÂ w =Â (w + (w >> 4)) & 0x0f0f0f0f;
>>
>>
>> Chenxi
>>
>>
>> On 2020/7/21 äå9:17, Palmer Dabbelt wrote:
>>> On Wed, 08 Jul 2020 22:19:22 PDT (-0700), maochenxi@xxxxxxxxx wrote:
>>>> Enable ARCH_HAS_FAST_MULTIPLIER on RV64I
>>>> which works fine on GCC-9.3 and GCC-10.1
>>>>
>>>> PS2: remove ARCH_SUPPORTS_INT128 because of RV64I already enabled.
>>>>
>>>> Signed-off-by: Chenxi Mao <maochenxi@xxxxxxxxx>
>>>> ---
>>>> Âarch/riscv/Kconfig | 1 +
>>>> Â1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>>>> index 128192e14ff2..84e6777fecad 100644
>>>> --- a/arch/riscv/Kconfig
>>>> +++ b/arch/riscv/Kconfig
>>>> @@ -202,6 +202,7 @@ config ARCH_RV64I
>>>> ÂÂÂÂ bool "RV64I"
>>>> ÂÂÂÂ select 64BIT
>>>> ÂÂÂÂ select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && GCC_VERSION >= 50000
>>>> +ÂÂÂ select ARCH_HAS_FAST_MULTIPLIER
>>>> ÂÂÂÂ select HAVE_DYNAMIC_FTRACE if MMU
>>>> ÂÂÂÂ select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
>>>> ÂÂÂÂ select HAVE_FTRACE_MCOUNT_RECORD
>>>
>>> Ah, thanks -- this one didn't show up when I was looking at the last one. I
>>> think we can put the fast multiplier on rv32 and rv64, there shouldn't be any
>>> difference there. I guess in theory we should be sticking this all in some
>>> sort of "platform type" optimization flags, but that's probably bit much for
>>> now.